Part I - Introduction
Whether you are a data scientist, AI developer, application developer, or another role involved with AI, improving the runtime performance of your AI models is crucial to:
- Efficiently use the compute resources that you already have.
- Reduce the cost of acquiring new compute resources.
- Improve turnaround time for everyone working on the model.
- Increase the complexity of what you can model with a given latency goal.
This guide is for anyone who wants to improve model performance on Intel® CPUs—whether you are trying to meet an aggressive baseline performance target or you just want to experiment to see if you can get more out of your system.
Part II - System Configuration
The first step is to simply check configurations at the operating system level. These are good places to start because they can quickly improve overall system performance.
Intel® System Health Inspector
To quickly gather these system-level parameters, you can use Intel® System Health Inspector.
Just run this utility to examine your system profile, test for performance issues, and get recommendations for improvements. It generates both an HTML report and a JSON file showing all the settings discussed in this section. It also provides recommendations for optimal settings, so you can quickly iterate over which settings to potentially change.
CPU Frequency Governors
The first system configuration to pay attention to is clock frequency. This is controlled by a driver, which in turn has various governors that dictate how the per-core frequency should scale.
The default driver on most Intel-powered systems is the Intel® P-State driver (intel_pstate). This example shows how to check your driver information:
$ cpupower frequency-info
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 400 MHz - 4.50 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 400 MHz and 4.50 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 1.31 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes
The output shows that the available cpufreq governors are performance and powersave. Typically the default governor is powersave – you can check your system with the following command:
$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
powersave
powersave
...
If you are using powersave, you can typically improve performance from simply setting this to performance using one of the following methods:
# echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Or using the cpupower tool, if available:
# cpupower frequency-set -g performance
Learn more about CPU frequency scaling, governors, and drivers.
Intel® Hyper-Threading Technology
Hyper-Threading allows more than one thread to run on each core. But many TensorFlow* and PyTorch* models do not benefit from Hyper-Threading. For this reason, Part III shows how to use various tools to bind workloads to a specific subset of cores on the system.
However, if your workload contains other components besides the TensorFlow or PyTorch model, you can test the overhead of Hyper-Threading to help determine the best approach for your workload.
First, check if you have Hyper-Threading enabled:
$ cat /sys/devices/system/cpu/smt/control
on
The output shows that it is enabled. If you wish to disable it, use the following command at runtime. Note that this will reset after a reboot:
echo off | sudo tee /sys/devices/system/cpu/smt/control
Linux* Kernel Version
Differences in Linux* kernel versions can affect performance results. So if you are comparing performance between multiple systems, be sure to use identical kernel versions.
In general, there is no single kernel version that offers the best performance for all workloads. However, Linux kernel version 5.16 introduces support for Intel® Advanced Matrix Extensions (Intel® AMX). Therefore, for systems based on 4th Gen Intel® Xeon® Scalable processors (code-named Sapphire Rapids) or Intel® Xeon® CPU Max Series (code-named Sapphire Rapids HBM), be sure to use at least version 5.16.
Transparent Huge Pages (THP)
The Transparent Huge Pages (THP) feature causes the kernel to aggregate individual 4KB pages into larger (usually 2MB) pages. This may or may not improve the performance of your workload, and it usually requires some experimentation.
There are three possible values: always, madvise, and never. The default value, madvise, allows applications to use the madvise system call to request that specific page ranges should use THP. always and never attempt to aggregate pages regardless of application hints.
The following command checks which setting you are using:
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always [madvise] never
Usually, the best way to change this setting is in the kernel parameters, followed by a reboot. Edit the file /etc/default/grub, adding the setting to the kernel command-line invocation:
GRUB_TIMEOUT=5
GRUB_DEFAULT=saved
GRUB_CMDLINE_LINUX="transparent_hugepage=never" # HERE
GRUB_DISABLE_RECOVERY="true"
Then re-create your GRUB boot configuration with something like this (depending on your distribution):
grub-mkconfig -o /boot/grub/grub.cfg
Part III - Benchmarking Methodology
Many AI workloads offer plenty of opportunities to improve performance by changing environment variables or software libraries.
A general approach for performance testing is as follows:
- Choose a model and dataset that you want to optimize.
- Choose a use case that suits your application and write a script to run it (See Part III for use cases and example scripts).
- Establish a baseline run with no system or application changes.
- Make one (1) change.
- Rerun your benchmarking script and compare the result.
General AI Performance Methodology
You can perform deep learning inference using two different strategies, or use-cases, each with a different metric for measuring performance.
The first is max throughput:
- Process as many examples per second as possible, passing in batches of size > 1.
- Typically uses all physical cores within a socket.
- Parallelize as much as possible within a single instance of the model.
The second is real-time inference (RTI), also referred to as low-latency:
- Process a single example as quickly as possible, with batch size = 1.
- Tries to avoid overhead from excessive threading or contention between processes.
- Confine instances to a smaller number of cores - in general, 2 or 4.
AI use cases usually benefit from binding to a specific subset of physical cores with numactl according to the following guidelines:
- For maximum throughput, try to use all physical cores on a given socket. One instance per socket:
numactl --localalloc --physcpubind=0-N [workload]
- For real-time inference, usually 2-4 cores per instance works best, with multiple instances of the model to fill all available cores. For example:
numactl --localalloc --physcpubind=0-3 [workload]
Many AI and machine learning frameworks use oneAPI software libraries to accelerate performance across a variety of hardware architectures. Deep learning frameworks use Intel® oneAPI Deep Neural Network Library (oneDNN) for multiarchitecture acceleration of key building blocks. The method by which you can take advantage of oneDNN may differ by framework and version.
TensorFlow
oneDNN is built into open source TensorFlow starting with version 2.5 and can be enabled with an environment variable. And starting with TensorFlow 2.9, oneDNN is enabled by default when running on all x86 Linux package and 2nd Gen Intel Xeon Scalable processors and newer CPUs. First, install TensorFlow:
pip3 install tensorflow
Then depending on your version of TensorFlow, you may need to set the TF_ENABLE_ONEDNN_OPTS. The following table shows how to control which backend TensorFlow uses:
TensorFlow Version |
oneDNN Backend |
Eigen Backend |
---|---|---|
< 2.9 |
export TF_ENABLE_ONEDNN_OPTS=1 |
Enabled by default |
>= 2.9 |
Enabled by default for 2nd Generation Intel® Xeon® Scalable Processors (formerly Cascade Lake) and later |
export TF_ENABLE_ONEDNN_OPTS=0 |
Two additional environment variables can unlock additional TensorFlow performance:
-
TF_ONEDNN_ASSUME_FROZEN_WEIGHTS
-
TF_ONEDNN_USE_SYSTEM_ALLOCATOR
You can check the TensorFlow onednn_env_vars.cc file for any potential updates to these performance-critical environment variables.
TensorFlow supports two methods of parallelization:
- Model-level: the maximum number of operators that can be executed in parallel
- Environment variable: TF_NUM_INTEROP_THREADS
- The recommended setting for both the max throughput and the real-time inference use cases is 1
- Operator-level: the maximum number of threads to use for an individual operator
- Environment variable: TF_NUM_INTRAOP_THREADS
- The recommended setting for the RTI use case is to sweep from 1-4 cores
- The recommended setting for the max throughput use case is the number of physical cores on a single socket of your system
For a tutorial that walks through these steps, see Optimize TensorFlow Pre-trained Model for Inference.
PyTorch
Install PyTorch with the following command:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
Intel® Extension for PyTorch extends PyTorch with the most up-to-date optimizations that take advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel AVX-512 Vector Neural Network Instructions (AVX-512-VNNI), and Intel AMX instruction sets on the latest Xeon CPUs. You can install it with the following command:
pip3 install intel_extension_for_pytorch -f
https://developer.intel.com/ipex-whl-stable-cpu
Enable these optimizations by adding just a couple of lines of code in a PyTorch program:
# Original model
model = ...
# Optimize with IPEX
import intel_extension_for_pytorch as ipex
model = ipex.optimize(model)
Consult the following resources for more details on Intel Extension for PyTorch:
TorchScript captures the structure of PyTorch models and converts them into a static representation. It applies operator fusions and constant folding to reduce the overhead of operations and execution time. Intel Extension for PyTorch amplifies these performance advantages.
To use TorchScript together with Intel Extension for PyTorch:
model = ...
data = ...
import intel_extension_for_pytorch as ipex
optimized_model = ipex.optimize(model)
with torch.no_grad():
traced_model = torch.jit.trace(optimized_model, data)
model = torch.jit.freeze(traced_model)
oneDNN Graph is an experimental backend within TorchScript, which further fuses operators within the model. An example of using oneDNN Graph:
model = ...
data = ...
torch.jit.enable_onednn_fusion(True)
with torch.no_grad():
traced_model = torch.jit.trace(model, data)
model = torch.jit.freeze(traced_model)
# use the model for inference
oneDNN Graph is available in the PyTorch nightly release as well as the main branch of the PyTorch source code.
Part IV - Model Analysis
If you find a performance issue that you think resulted from changing one of the settings previously discussed, debugging the issue requires collecting traces while running the model. An example scenario typically proceeds as follows:
- You identify that oneDNN is not showing any performance gains, or showing a performance regression, compared to another backend.
- Depending on your framework, collect traces of your model's runtime with and without oneDNN.
- Visualize the traces, observe the top-10 operator list or statistics. oneDNN operators will begin with _Mkl.
- Identify the issue by comparing the amount of time that the CPU spends on a particular operator.
You can use this approach to analyze a variety of scenarios, such as the modification of batch size, environment variables, or framework API calls. This is typically an iterative process and may require assistance.
The following sections describe framework-specific analysis tools, and for additional support you can post that information to the Intel® Optimized AI Frameworks forum.
TensorFlow
- Instrument code. Use the TensorFlow Profiler to collect performance data from your model.
If you are interested in the resulting graph, you can generate it by running your model with the following environment variables set:TF_CPP_MAX_VLOG_LEVEL=1 TF_DUMP_GRAPH_PREFIX=./dump_graphs
- Analyze. Follow the steps in the TensorFlow Profiler tutorial to analyze performance for both inference and training.
- Visualize. Use either TensorBoard* or Chrome* tracing to visualize the TensorFlow Profiler output.
PyTorch
- Instrument code. Use the PyTorch profiler to collect performance data from your model.
- Analyze the execution time as shown in the profiler guide. You can also analyze memory consumption.
- Visualize profiler stack traces as text output, using Chrome tracing, or using TensorBoard with PyTorch.
oneDNN
- Generate verbose output. Use run-time controls to generate information on primitive calls and their execution time.
- Trace oneDNN kernel usage. Use the oneDNN verbose log parser to visualize usage of primitive types and JIT kernels.
- Deeply analyze performance. Get started with the examples shown on the oneDNN performance profiling guide.
Part V - AI Model Precision
Usually, model parameters are stored as 32-bit floating-point values (FP32 format). For many models, using lower-precision formats such as bfloat16 or int8 can significantly improve throughput and latency while oftentimes maintaining precision.
Lower-precision formats reduce memory consumption and bandwidth, which is often a model’s key performance limiter. And using these formats enables advanced Instruction Set Architectures (ISAs) available on Intel CPUs. This table shows which ISAs take advantage of each datatype:
Supported ISA |
FP32 |
int8 |
bfloat16 |
---|---|---|---|
Yes |
Yes |
Yes |
|
No |
Yes |
Only on 3rd Gen and 4th Gen Intel Xeon Scalable processors |
|
No |
Yes |
Yes |
You can see if your hardware supports an ISA with this guide.
The method by which you convert your model to a lower-precision datatype depends on your framework.
PyTorch
Automatic Mixed Precision (AMP) modifies a model such that some operations use the FP32 datatype while others use a lower-precision datatype, such as bfloat16. Operations such as Linear or the Convolution operations are much faster in lower-precision computation. Other operations, like Reduction, often require the dynamic range of FP32. Use the autocast Python* decorator to enable regions of the Python script to use AMP:
model = ...
data = ...
with torch.cpu.amp.autocast(model, dtype=bfloat16):
model(**data)
You can use this together with Intel Extension for PyTorch by calling ipex.optimize(model) before calling autocast.
TensorFlow
Similar to PyTorch, TensorFlow mixed precision supports mixing FP32, FP16, and bfloat16 datatypes. Intel contributed support for the bfloat16 format. Mixed precision is supported in graph-based and Keras* models by adding a couple lines of code. The following snippets illustrate:
bfloat16 with Keras models:
from tensorflow import keras
from tensorflow.keras import layers
# Enable bfloat16
from tensorflow.keras import mixed_precision
policy = mixed_precision.Policy('mixed_bfloat16')
mixed_precision.set_global_policy(policy)
bfloat16 with graph-based models:
graph_options = tf.compat.v1.GraphOptions(
rewrite_options=rewriter_config_pb2.RewriterConfig(
auto_mixed_precision_mkl=rewriter_config_pb2.RewriterConfig.ON)
)
with tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(
graph_options=graph_options)) as sess:
bfloat16 with saved or TensorFlow Hub models:
hub_model = hub.load(model_handle)
tf.config.optimizer.set_experimental_options(
{'auto_mixed_precision_mkl':True})
results = hub_model(input)
For more detail and examples, see Getting Started with Mixed Precision Support in oneDNN Bfloat16.
Model Quantization
Quantization is the process of converting a model to use lower-precision datatypes while trying to meet an accuracy goal. This allows for a more compact model representation and usage of high-performance vectorized operations, as in the ISAs described above.
You can perform basic quantization using the APIs in PyTorch and TensorFlow, or for more customization, a dedicated product such as Intel® Neural Compressor. Intel Extension for PyTorch uses Intel Neural Compressor for both static and dynamic quantization. Here is an example of static quantization:
model = ...
example_inputs = ...
import intel_extension_for_pytorch as ipex
from intel_extension_for_pytorch.quantization import prepare, convert
# Define the quantization configuration
qconfig = ipex.quantization.default_static_qconfig
# Prepare the model for quantization
prepared_model = prepare(model, qconfig, example_inputs=example_inputs)
# Calibrate the model
for x in calibration_data_set:
prepared_model(x)
# Convert the model, then convert JIT
converted_model = convert(prepared_model)
with torch.no_grad():
traced_model = torch.jit.trace(convert_model, example_input)
model = torch.jit.freeze(traced_model)
# Use the model for inference
Intel Neural Compressor can quantize your model with a unified interface across deep-learning frameworks. Using technologies such as quantization, pruning, and knowledge distillation, the scripts provided by Intel Neural Compressor can generate highly optimized quantized models which meet your specified accuracy goals. See Intel Neural Compressor examples repository to learn how to get started with various frameworks and model types.
Part VI – Support Resources
If you have any further questions on the suggestions in this document, or if would like further support in improving AI performance on CPUs, a good place to start is the Intel Optimized AI Frameworks developer forum.
To learn more about Intel’s full offering of AI tools and frameworks optimized for CPU and GPU, visit developer.intel.com/ai.
"