By Andres Rodriguez and Niveditha Sundaram
Every day, the world generates more and more information — text, pictures, videos and more. In recent years, artificial intelligence and deep learning have improved several applications that help people better understand this information with state-of-the-art voice/speech recognition, image/video recognition, and recommendation engines.
Most deep learning workloads consists of both training and inference. Training usually requires many hours or days to complete. Inference usually requires milliseconds or seconds and is often a step of a larger process. While the computing intensity of inference is much lower than that of training, inference is often done on a much larger dataset. Therefore, the total computing resources spent on inference are likely to dwarf those spent on training. The overwhelming majority of all inference workloads run on Intel Xeon CPUs.
Over the past year, Intel rapidly added CPU support across several deep learning frameworks to optimize for a variety of training and inference applications. At the center of these optimizations is Intel® Math Kernel Library (Intel MKL) which makes use of Intel Advanced Vector Extension CPU instructions (e.g., Intel AVX-512) that provide enhanced support for deep learning applications.
Caffe2* is an open source deep learning framework created by Facebook and built with expression, speed, and modularity in mind. Caffe2 is deployed at Facebook to help researchers train large machine learning models and deliver AI on mobile devices. Now, developers will have access to many of the same tools, allowing them to run large-scale distributed training scenarios and build machine learning applications for mobile.
Intel and Facebook are collaborating to integrate Intel MKL functions into Caffe2 for optimal inference performance on CPU’s. Table 1 shows inference performance numbers on AlexNet* using the Intel MKL library and the Eigen* BLAS library for comparison. In this table, OMP_NUM_THREADS indicates the number of physical cores used in these workloads (details in the table caption). These results show that Caffe2 is highly optimized on CPUs and offers competitive performance. For small batch inference workloads it is recommended to run each workload in each CPU core and run multiple workloads in parallel with one workload per core.
|
OMP_NUM_THREADS=44 |
OMP_NUM_THREADS=1 |
||
batch size |
Intel MKL (images/sec) |
Eigen BLAS (images/sec) |
Intel MKL (images/sec) |
Eigen BLAS (images/sec) |
1 |
173.4 |
5.2 |
28.6 |
5.1 |
32 |
1500.2 |
29.3 |
64.6 |
15.4 |
64 |
1596.3 |
35.3 |
66.0 |
15.5 |
256 |
1735.2 |
44.9 |
67.3 |
16.2 |
Table 1: Performance results on Caffe2 using the AlexNet topology with Intel® MKL and Eigen BLAS. Experiments were performed on Intel Xeon processor E5-2699 v4 (codename Broadwell) @ 2.20GHz with dual sockets, 22 physical cores per socket (total of 44 physical cores in both sockets), 122GB RAM DDR4, 2133 MHz, HT Disabled, on Linux 3.10.0-514.2.2.el7.x86_64 CentOS 7.3.1611, Intel MKL version 20170209, Eigen BLAS version 3.3.2, based on Caffe2 as of April 18, 2017.
Instructions to install and use Caffe2 can be found at this link http://Caffe2.ai.
Later this year, the new generation of Intel Xeon processors (codename Skylake) will become available to the general market. Skylake introduces the 512-bit wide Fused Multiply Add (FMA) instructions as part of the larger 512-bit wide vector engine, i.e., Intel AVX-512, providing a significant performance boost over the previous 256-bit wide AVX2 instructions in the Haswell/Broadwell processor for both training and inference workloads. The 512-bit wide FMA’s essential doubles the FLOPS that Skylake can deliver and significantly speeds up single precision matrix arithmetic used in convolutional and recurrent neural networks. Inference workloads are massively parallel and will benefit from the larger core count offered by Skylake. In addition, the Skylake CPUs have re-architected memory subsystem supporting faster system memory and larger Mid-Level-Cache (MLC) per core, which also helps with the performance improvements over current generation CPUs and significant enhancement over the common installed base of four year old systems.