Improved Parallelization, Extended Deep Learning Capabilities in Intel® Distribution of OpenVINO™ Toolkit

ID 标签 659594
已更新 3/25/2019
版本 Latest
公共

author-image

作者

The latest release of Intel® Distribution of OpenVINO™ toolkit (which stands for Open Visual Inference and Neural Network Optimization) unveils new features that improve parallelization, extend deep learning capabilities, and provides support for macOS*. Get a quick view of the major new enhancements. Then learn more about new parallelization capabilities that deliver optimal performance for multi-network scenarios.

New Features in 2019 R1

  • Supports 2nd generation Intel® Xeon® Processors (codenamed Cascade Lake) and provides performance speedup for inference through Intel® Deep Learning Boost (VNNI instruction set).
  • Extends support for macOS* on Intel CPUs for key toolkit components (model optimizer, inference engine, OpenCV*, and more).
  • Transitions parallelism schemes from OpenMP* to Threading Building Blocks (TBB) to provide increased performance in a multi-network scenario. The most common deployment pipelines run multiple network combinations, and TBB delivers optimal performance for these use cases. 
  • Adds support for many new operations in ONNX*, TensorFlow*, and Apache MXNet* frameworks. Provides optimized inference on topologies like Tiny YOLO* version 3, the full DeepLabs* version 3, and bidirectional long short-term memory (LSTM) using the Deep Learning Deployment Toolkit.
  • Includes eight pretrained models for new use cases: gaze estimation, action recognition encoder, action recognition decoder, text recognition, and instance segmentation networks. (For fast, streamlined deep learning inference, the toolkit includes about 40 pretrained models.)
  • Supports binary weights to boost performance, and adds four binary models: ResNet 50, and face, person, and vehicle detection.
  • Updates the FPGA plugin to the Deep Learning Accelerator 2019 R1 with new bitstreams for the Intel® Vision Accelerator Design, Intel® Arria® 10 FPGA developer kit, and Intel® Programmable Acceleration Card with Intel® Arria® 10 GX FPGA. Supports automatic scheduling between multiple FPGA devices. 

Get more details in the Release Notes.

Parallelism Scheme Change Delivers Optimal Performance for Multi-network Scenarios

With the Intel Distribution of OpenVINO toolkit’s 2019 release, the parallelization library previously run in OpenMP* was transitioned to Intel Threading Building Blocks. The main reason for this change is simply because most of the toolkit’s users/Intel customers use complex multi-network pipelined applications (object detection followed by classification for instance) where TBB has shown to drastically improve performance over OpenMP. TBB is naturally suited to to the toolkit because its inference engine is implemented in C++ and uses most of the C++ language’s advanced programming idioms (such as templates). TBB’s handling of complex parallel patterns (not necessarily just loops), scalable nested parallelism, the flow graph interface, and data flow pipeline models are just a few features that differentiate TBB from OpenMP.

With complex multi-network-pipelined applications, TBB offers superior performance advantages using features such as the TBB task scheduler’s stealing technique and dynamic load scheduling algorithm. The end result for the customer is improved processor utilization doing useful work without over-subscription (too many threads leads to overhead) and minimal under-subscription (too few threads leads to CPU cores sitting idle). 

In the OpenVINO™ toolkit’s open source version, developers still have an option to compile with OpenMP, however moving forward, the toolkit embraces TBB as the default parallelization solution.

Performance Improvements Demonstrated for Multi-network Code Samples

Back at the Intel ranch, our developers observe that about half of the OpenVINO samples show substantial performance improvement while another half show (as little as 1% or as much as 35%) degradation, with variations dependent upon hardware configuration.  The samples which contain more than one network pipelined show a 30% to 200% performance increase for TBB over OpenMP, again with variations dependent upon hardware configuration. The ones which degraded are single network samples which are most prone to degradation after conversion to TBB, and as aforementioned single-network applications are least commonly used by OpenVINO customers. TBB uses dynamic work distribution which helps tremendously in truly parallel execution, therefore optimally using the CPU cores.  OpenMP on the other hand distributes work statically once during program startup. Thus OpenMP is an ideal solution when entire CPU cores are 100% available and dedicated to a task. It’s for this very reason that OpenMP is used for benchmarking and also why single network (un-pipelined) OpenVINO samples perform better with OpenMP.

In OpenVINO R5, Throughput Mode was introduced and TBB doesn't impact its performance as long as the number of infer requests in flight is met by the number of physical cores. TBB's thread startup costs are not pronounced in Throughput Mode and its fine grained threading is able to play well with Throughput Mode in most cases.

Enabling Throughput Mode in your app is as simple as :

        ExecutableNetwork executable_network = plugin.LoadNetwork(network, {{PluginConfigParams::KEY_CPU_THROUGHPUT_STREAMS, std::to_string(number_of_requests)}});

But as we point out in our Performance Topics and CPU Plugin Configuration Parameters your app needs to provide enough parallel slack for throughput mode, which is not only application specific but also model specific and CPU Configuration Specific. Thus as mentioned above you'll get best results when the number of infer requests in flight are met by the number of physical cores available. When every physical core processes a dedicated input, the amount of synchronization required with other threads is minimal.

Flags such as OMP_NUM_THREADS, KMP_AFFINITY  and flags corresponding to kmp_, omp_ API calls have no effect anymore and with TBB are not needed. There is a family of TBB configuration params for the CPU plugin which mimics OpenMP tweaks, like KEY_CPU_THREADS_NUM but of course, they're TBB parameters not OpenMP ones. Please refer to the TBB docs for details.

Indeed the TBB switch is a big change but the OpenVINO team is fully committed to fixing any performance issues which may arise in models that suffer degradation due to it.

Don’t hesitate to use the open source version of OpenVINO and run experiments with both OpenMP and TBB, and why not mix in Throughput Mode for good measure.  Also use the OpenVINO benchmarking app to measure inference performance:

OpenVINO benchmark_app

As always, post your issues on the Intel Developer Zone OpenVINO forum or on the DLDT GitHub* and we’ll do our best to answer your questions!