Guide to TensorFlow Runtime Optimizations for CPU

Overview

Runtime settings can greatly affect the performance of TensorFlow* workloads running on CPUs, particularly regarding threading, data layout.

OpenMP* and TensorFlow both have settings that should be considered for their effect on performance. The Intel® oneAPI Deep Neural Network Library (oneDNN) within the Intel® Optimization for TensorFlow* uses OpenMP settings as environment variables to affect performance on Intel CPUs. TensorFlow has a class (ConfigProto or config depending on the version) with settings that affect performance.

Most of the recommendations work on both official x86-64 TensorFlow and Intel® Optimization for TensorFlow. Some recommendations such as OpenMP tuning only apply to Intel® Optimization for TensorFlow

This guide will describe how to set the running variables to optimize Tensorflow* for CPU.

OpenMP* settings descriptions

OMP_NUM_THREADS
- Maximum number of threads to use for OpenMP parallel regions if no other value is specified in the application.
- Recommend: start with the number of physical cores/sockets on the test system, and try increasing and decreasing.
KMP_BLOCKTIME
- Time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
- Recommend: start with 1 and try increasing.
KMP_AFFINITY
- Restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer. Only valid if Hyperthreading is enabled.
- Recommend: granularity=fine,verbose,compact,1,0
KMP_SETTINGS
- Enables (TRUE) or disables (FALSE) printing of OpenMP run-time library environment variables during execution.
- Recommend: Start with TRUE to ensure settings are being utilized, then use as needed.

How to apply OpenMP settings

These settings are applied as environment variables by two methods:

Shell
- Example:

export OMP_NUM_THREADS=16

Python code
- Example:

import os
os.environ["OMP_NUM_THREADS"] = “16”

TensorFlow* settings

intra_op_parallelism_threads
- Number of threads used within an individual op for parallelism.
- Recommend: start with the number of cores/sockets on the test system, and try increasing and decreasing.
inter_op_parallelism_threads
- Number of threads used for parallelism between independent operations.
- Recommend: start with the number of physical cores on the test system, and try increasing and decreasing.
device_count
- Maximum number of devices (CPUs in this case) to use.
- Recommend: start with the number of cores/sockets on the test system, and try increasing and decreasing.
allow_soft_placement
- Set to True/enabled to facilitate operations to be placed on CPU instead of GPU.

How to apply TensorFlow settings

These settings are applied in Python* code using Config Proto or config

Example in TensorFlow version 1.X:

import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=16, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': 16})
session = tf.Session(config=config)

Example in TensorFlow 2.X:

import tensorflow as tf
tf.config.threading.set_inter_op_parallelism_threads() 
tf.config.threading.set_intra_op_parallelism_threads()
tf.config.set_soft_device_placement(enabled)

Intel® oneDNN enabling settings

TensorFlow* is highly optimized with Intel® oneAPI Deep Neural Network Library (oneDNN) on CPU. The oneDNN optimizations are now available both in the official x86-64 TensorFlow binary and Intel® Optimization for TensorFlow* since v2.5.

Users can enable oneDNN optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1 for the official x86-64 TensorFlow for v2.5-v2.8. Since v2.9, no environment setting is needed as oneDNN is default DNN library in official x86-64 TensorFlow.

export TF_ENABLE_ONEDNN_OPTS=1

Users could enable/disable usage of oneDNN blocked data format in Tensorflow by TF_ENABLE_MKL_NATIVE_FORMAT environment variable. By exporting TF_ENABLE_MKL_NATIVE_FORMAT=0, TensorFlow will use oneDNN blocked data format instead. Please check oneDNN memory format for more information about oneDNN blocked data format.

We recommend users to enable NATIVE_FORMAT by below command to achieve good out-of-box performance.
export TF_ENABLE_MKL_NATIVE_FORMAT=1 (or 0)

export TF_ENABLE_MKL_NATIVE_FORMAT=1

oneDNN Related Environment Variables used within TensorFlow 2.9

Environment Variables	Default	Purpose
TF_ENABLE_ONEDNN_OPTS	True	Stock Tensorflow: enable/disable oneDNN optimization
TF_ONEDNN_ASSUME_FROZEN_WEIGHTS	False	Are WeightsFrozen(): tell Tensorflow if weights are frozen or not. Better performance is achieved with frozen graphs. Set for inference onely. Related ops: fwd conv, fused matmul
TF_ONEDNN_USE_SYSTEM_ALLOCATOR	False	UseSystemAlloc(). tell oneDNN user system allocator or not. Usage:MklCPUAllocator, Set it to true for better performance in case of small allocation
TF_MKL_ALLOC_MAX_BYTES	64	MklCPUAllocator: user can use it to set upper bound on memory allocation. Unit:GB
TF_MKL_OPTIMIZE_PEIMITIVE_MEMUSE	False	Enable/disable primitive caching. Disabling primitive caching will reduce memory usage but impacts performance. Set false to enable primitive caching

References

Maximize TensorFlow* Performance on CPU: Considerations and Recommendations for Inference Workloads
Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs
TensorFlow Guide: Optimizing for CPU
TensorFlow ConfigProto for TensorFlow 1.x
TensorFlow config for TensorFlow 2.x

Notices and Disclaimers

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索

Guide to TensorFlow* Runtime Optimizations for CPU