Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability

MaryT_Intel · ‎06-17-2020

AT A GLANCE

Facebook and Intel collaborated to improve PyTorch performance on 3rd Gen Intel® Xeon® Scalable Processors.
Harnessing Intel® Deep Learning Boost’s new bfloat16 capability, the team was able to substantially improve PyTorch performance across multiple training workloads – improving representative computer vision models training performance by up to 1.64x, DLRM model training performance by up to 1.40x over FP32, and INT8 inference performance for the DLRM model by up to 2.85x over FP32.

Intel and Facebook have previously demonstrated the benefits of BFloat16 (BF16) across multiple deep learning training workloads with the same accuracy as 32-bit floating-point (single-precision) (FP32) and with no changes to the training hyper-parameters. Today, Intel is announcing the 3rd Gen Intel^® Xeon^® scalable processors (formerly codename Cooper Lake) with Intel^® Deep Learning Boost’s(Intel DL Boost) new BF16 technology to accelerate training and inference performance. This comes in addition to the support for the Intel DL Boost INT8 technology introduced last year.

Intel and Facebook continue their collaboration to improve performance of machine learning models on PyTorch, this time working together to enable BF16 technology and deliver up to 1.64x BF16 over FP32 training performance improvements on the 3rd Gen Intel Xeon scalable processors. This collaboration will benefit the PyTorch community by enabling faster training and inference times on CPUs.

In this article, we detail Intel DL Boost’s BF16 capability in the new 3rd Gen Intel Xeon processor, and the joint collaboration between Intel and Facebook to bring the benefits of this feature to the PyTorch community.

Hardware advancements

Most machine learning applications today use 32-bit floating-point (single-precision) (FP32) for training and inference workloads. Deep learning practitioners have demonstrated effectiveness lower numerical precisions for both training [BF16]and inference[INT8], using 16-bit multipliers with 32-bit accumulators for training and inference with insignificant to no loss in the accuracy, and 8-bit multipliers with 32-bit accumulators for some inference workloads with some to minimal loss in the accuracy driving their wider adoption on workloads like DLRM.

Lower precision increases the performance in two ways: 1) the additional multiply-accumulate (MAC) throughput boosts compute-bound operations, and 2) the reduced footprint (from using 16-bits rather than 32-bits) boosts memory bandwidth bound operations by enabling faster data movement through the memory hierarchy.

Intel is introducing native BF16 support in 3rd gen Intel Xeon Scalable processors with BF16→ FP32 fused multiply-add (FMA), shown in Figure 1, and FP32→BF16 conversion Intel^® Advanced Vector Extensions-512 (Intel^® AVX-512) instructions that double the theoretical compute throughput over FP32 FMAs. In addition, this hardware natively supports INT8 FMAs introduced in the previous generation, which quadruples the theoretical compute throughput over FP32 FMAs.

Figure 1: The Intel® AVX-512 VDPBF16PS instruction multiplies 32 BF16 value pairs and accumulates to 16 fp32 values per clock cycle throughput per core per FMA unit.

Multiple platform support

Major OEM system providers are expected to offer the 3rd Gen Intel Xeon Scalable processor primarily in 4- and 8-socket configurations balancing cost and performance. The large memory capacity of a 4S or 8S system enables hosting large embedding tables in main memory for faster training, such as in SparseNNs. Intel DL Boost BF16 instructions drive better performance for training and serving models. Inferencing workloads can also benefit from Intel DL Boost with either BF16 or INT8 acceleration depending on model and accuracy requirements.

Software Advancements

Intel and Facebook have collaborated to bring BF16 performance improvements
on CPUs with the latest hardware advancements, as well as offering a simple programming interface to enable BF16. PyTorch provides a simple API for users to enable BF16 data type for deep learning as a drop-in replacement for FP32 models. Machine learning practitioners usually design and train DNNs with FP32 data type first before converting the model to a low-precision data type for better performance. Unlike FP16, which typically involves loss scaling tuning to achieve comparable training accuracy as FP32, BF16 works without tuning. Users only need to convert the model input data to BF16 to enable BF16 training from an existing FP32 model. PyTorch 1.3+ automatically propagates the BF16 data type throughout all operations of the model.

The PyTorch ATen operators are implemented on top of basic scalar or vector operations associated with the data type in the operator template. For most operations, Intel added BF16 support as an operator overload for these basic operators. For example, BF16 ReLU is built on top of min and max BF16 operators. Special care is taken for operations that have an internal reduction, which are accumulated to higher precision (FP32) to ensure model accuracy.

Intel further optimized the popular torch.nn ops, such as convolution, matrix multiplication, batch normalization, ReLU, pooling, and etc., using the oneAPI Deep Neural Network Library (oneDNN, also formerly called the Intel^® MKL-DNN Library). PyTorch 1.5+ includes oneDNN with BF16 optimizations for popular operations using 3rd Gen Intel Xeon processors’ native BF16 instructions. The Intel AVX-512 VDPBF16PS instruction multiplies 32 BF16 value pairs and accumulates to 16 FP32 values within one cycle per core per FMA unit, as shown in Figure 1, and the VCVTNE2PS2BF16 instruction converts 32 FP32 values to 32 BF16 values. For operations not covered by the oneDNN library, using BF16 also helps memory-bound layers. For BF16 operations with no support in oneDNN, BF16 is used for data movement and to emulate BF16 computations using FP32 FMA instructions. These optimizations minimize data transfers and ensure effective usage of SIMD instructions, execution units, registers, and memory cache hierarchy.

The 3rd Gen Intel Xeon Scalable processor’s built-in BF16→FP32 FMA instruction doubles the theoretical throughput over FP32 FMAs, and the BF16 format alone improves the memory access performance by up to 2x. The oneDNN library provides a highly optimized implementation for BF16 convolution and GEMM operations (as well as those fused with activations). During training, a copy of the weights is stored in FP32 for the weight updates. Therefore, the model execution also needs to handle the FP32 data type.

Facebook published the Deep Learning Recommendation Model (DLRM) (see code). Intel improved the performance of all the components of DLRM including the multi-layer perceptron (MLP) layers, interactions, and embeddings. On top of a well-optimized DLRM FP32 training model, BF16 provides over 1.40x end-to-end gains using FP32 for the optimizer, and 2.85x INT8 inference gains.

Results

Table 1 shows a 1.64x, and 1.60x gain using BF16 to train ResNet-50, and ResNeXt -101 32x4d, respectively, using a 4S 3rd Gen Intel Xeon Scalable processor, and a 1.40x gain using BF16 on DLRM training using a 1S 3rd gen Xeon processor. These results take advantage of PyTorch native integration with oneDNN.

Training	# Cores per instance	# Instances	BF16 (samples/s)	FP32 (samples/s)	Speedup Ratio
DLRM	28	1	99321	71061	1.40
ResNet-50	28	4	399	243	1.64
ResNeXt-101 32x4d	28	4	193	120	1.60

Table 1. Single instance BF16 training performance gains over baseline (FP32 with Intel^® Math Kernel Library (Intel MKL)) using batch size 2K for MLPerf DLRM model, and four instances BF16 training performance gains over baseline (FP32 with Intel oneDNN) using batch size 128/instance for ResNet50/ResNext101 32x4d. The DLRM model hyperparameters use the MLPerf configuration.

Intel also optimized DLRM inference with the low-precision INT8 data type. Embeddings are quantized to INT8 which reduces the memory footprint of large embeddings by nearly 4x. The compute-intensive MLPs are accelerated with INT8 instructions and achieved over 3.5x speed-up. The end-to-end model speed-up is 2.85x as shown in Table 2.

Inference	# Cores per instance	# Instances	INT8 (samples/s)	FP32 (samples/s)	Speedup Ratio
DLRM	1	28	611082	214559	2.85

Table 2. INT8 inference performance gains over baseline (FP32 with Intel MKL) using batch size 16 and 28-instances of DLRM on a single socket 3rd gen Intel Xeon processor, integrated with PyTorch multi-instance share-weight solution and 8bit implementation for MLP and Embeddingbag.

Conclusion

Intel and Facebook will continue to collaborate to accelerate PyTorch training and inference across multiple data types. We have enabled and optimized the BF16 data type for PyTorch and improved representative computer vision models training performance by up to 1.64x and the DLRM model training performance by up to 1.40x over FP32. The 3rd Gen Intel Xeon Scalable processor’s acceleration for INT8 inference improved the DLRM model inference performance by up to 2.85x over FP32.

About the authors

Andres Rodriguez, PhD, is a Sr. Principal Engineer and the lead AI Architect at Intel Data Platform Group (DPG) where he designs deep learning solutions for cloud and enterprise customers and provides technical leadership across Intel for deep learning products. He has 15 years of experience working in artificial intelligence. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer-reviewed publications in journals and conferences and a book chapter on machine learning.

Jianhui Li, PhD, is a principal engineer at Intel® architecture, Graphics and Software group and leads deep learning framework integration and workload optimization. He was a software developer for binary translation and JIT compiler and led the development of Houdini which runs Android* ARM applications transparently with comparable user experience on AI-based platforms. Jianhui received his PhD from Fudan University on computer science. He holds 21 US patents in binary translation and real-life application optimization.

Jiong Gong is a senior software engineer at Intel® architecture, Graphics and Software group where he is a software architect on PyTorch framework optimization for Intel Architecture and one of the major contributors to low-precision inference solutions on IA. He has 8 years of full-stack experience in artificial intelligence from various AI applications to framework, library and compiler optimizations. Jiong received his master’s degree in computer science from Shanghai Jiao Tong University.

Hongzhen Liu is a senior software engineer at Intel® architecture, Graphics and Software group and leads the development and optimization for low-precision (BFloat16, FP16, INT8) solutions on Deep Learning frameworks for Intel Architectures. He has rich experience in parallel computing, full-stack software optimization, and high-performance math kernel design skills for AI applications. Hongzhen received his master’s degree in Pattern Recognition and Intelligent System from Southeast University.

Shivani Sud is a system architect working on cloud technologies and ML system architecture. She has been a leading contributor to telco network transformation to software-defined infrastructure. Prior to that her research contributions have been in next-gen mobile devices, multi-device usages, and platform security. She holds 7 US patents and has authored numerous peer-reviewed publications.

Configuration Details

ResNet50/ResNext101 (FP32/BF16): batch size 128/instance, 4 instances.
ResNet50/ResNext101 dataset (FP32/BF16): ImageNet Dataset
DLRM batch size (FP32/BF16): 2K/instance, 1 instance
DLRM dataset (FP32/BF16): Criteo Terabyte Dataset
DLRM batch size (INT8): 16/instance, 28 instances, dummy data.
Tested by Intel as of 6/2/2020.
Intel Xeon Platinum 8380H Processor, 4 socket, 28 cores HT On Turbo ON Total Memory 768 GB (24 slots/ 32GB/ 3200 MHz), BIOS: WLYDCRB1.SYS.0015.P96.2005070242 (ucode: 0x700001b), Ubuntu 20.04 LTS, kernel 5.4.0-29-generic
PyTorch: https://github.com/pytorch/pytorch.git
Intel Extension for PyTorch: https://github.com/intel/intel-extension-for-pytorch.git
gcc: 8.4.0,
oneDNN version: v1.4
ResNet50: https://github.com/intel/optimized-models/tree/master/pytorch/ResNet50
ResNext101 32x4d: https://github.com/intel/optimized-models/tree/master/pytorch/ResNext101_32x4d
DLRM: https://github.com/intel/optimized-models/tree/master/pytorch/dlrm

References

https://software.intel.com/content/www/us/en/develop/articles/intel-and-facebook-collaborate-to-boost-pytorch-cpu-performance.html

https://www.facebook.com/yann.lecun/posts/10155891487497143

^{Notices and Disclaimers}

^{Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.}

^{This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.}

^{Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.}

^{© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.}