Software AI Accelerators: AI Performance Boost for Free

Get the Latest on All Things CODE

author-image

作者

The exponential growth of data has fed AI's voracious appetite and led to its transformation from niche to omnipresent. An equally important aspect of this AI growth equation is the ever-expanding demands it places on computer system requirements to deliver higher AI performance. This has not only led to AI acceleration being incorporated into common chip architectures such as CPUs, GPUs, and FPGAs but it has also mushroomed a class of dedicated hardware AI accelerators specifically designed to accelerate artificial neural networks and machine learning applications. While these hardware accelerators can deliver impressive AI performance improvements, software AI accelerators are required to deliver even higher orders of magnitude AI performance gains across deep learning, classical machine learning, and graph analytics for the same hardware setup. What’s more is that this AI performance boost driven by software optimizations is free, requiring almost no code changes or developer time and no additional hardware costs.

Let us try to visualize the scope of the cost savings that can be realized through the 10x–100x performance gains that can be realized through software AI acceleration. For example, many of the leading streaming media services have tens of thousands of hours of available content. They might want to use image classification and object detection algorithms for content moderation, text identification, and celebrity recognition. The classification criteria might also be different by country based on local customs and government regulations, and the process might need to be repeated for approximately 10% of the content every month based on new programs and rule changes. Using the list prices of running these AI algorithms on the leading cloud service providers, even a 10x gain in performance through software AI accelerators can lead to approximate cost savings of millions of dollars a month.†

Similar cost savings can also be realized for other AI services, such as automatic caption generation and recommendation engines, and of course the savings are even higher for the use cases with 100x performance improvements. While your particular AI workload might be markedly smaller, your savings could still be rather significant.

Software determines the ultimate performance of computing platforms and software acceleration is therefore key to enabling AI Everywhere with applications across entertainment, telecommunication, automotive, healthcare, and more.

What Is a Software AI Accelerator and How Does It Compare to a Hardware AI Accelerator?

A software AI accelerator is a term used to refer to the AI performance improvements that can be achieved through software optimizations for the same hardware configuration. A software AI accelerator can make platforms over 10x–100x faster across a variety of applications, models, and use-cases.

The increasing diversity of AI workloads has necessitated a business demand for a variety of AI-optimized hardware architectures. These can be classified into three main categories: AI-accelerated CPU, AI-accelerated GPU, and dedicated hardware AI accelerators. We see multiple examples of all three of these hardware categories in the market, for example, Intel® Xeon® CPUs with Intel® Deep Learning Boost, Apple* CPUs with a neural engine, NVIDIA* GPUs with tensor cores, Google Tensor Processing Unit* integrated circuit, AWS Inferentia*, Intel® Gaudi® AI accelerators, and many others that are under development by a combination of traditional hardware companies, cloud service providers, and AI startups.

While AI hardware has continued to take tremendous strides, the growth rate of AI model complexity far outstrips hardware advancements. About three years ago, a natural language AI model like ELMo had "just" 94 million parameters whereas this year, the largest models reached over one trillion parameters. The exponential growth of AI means that even 1,000x increases in computing performance can be easily consumed to solve ever more complex and interesting use cases. Solving the world’s problems and getting to the holy grail of AI Everywhere is therefore only possible through the orders of magnitude performance enhancements driven by software AI accelerators.

While hardware acceleration is like updating your bike to have the latest and greatest features, software acceleration is more like having a completely re-envisioned mode of travel such as a supersonic jet.

Figure 1. An illustration of hardware acceleration compared to software acceleration.

This article specifically lays out the performance data of software AI accelerators on Intel Xeon CPUs. However, we believe that similar magnitudes of performance improvements can be achieved on other AI platforms from AI-accelerated CPUs and GPUs to dedicated hardware AI accelerators. We intend to share performance data for our other platforms in future articles but we also welcome other vendors to share their software acceleration results.

AI Software Ecosystem: Performant, Productive, and Open

As AI use cases and workloads continue to grow and diversify across vision, speech, recommender systems, and more, Intel’s goal has been to provide an unparalleled AI development and deployment ecosystem that makes it as seamless as possible for every developer, data scientist, researcher, and data engineer to accelerate their AI journey from the edge to the cloud.

Figure 2. An illustration of the Intel software ecosystem.

We believe that an end-to-end AI software ecosystem, built on the foundation of an open, standards-based, interoperable programming model, is key to scaling AI and data science projects into production. This core tenet forms the foundation of our three-pronged AI strategy:

  1. Build upon the broad AI software ecosystem: It is critical for us to embrace the current AI software ecosystem. We want everyone to use the software that they are familiar with in deep learning, machine learning, and data analytics, for example, from TensorFlow* and PyTorch*, scikit-learn* and XGBoost, to Ray framework and Apache Spark*. We have heavily optimized these frameworks and libraries to help increase their performance by orders of magnitude on Intel platforms designed to deliver drop-in 10x–100x software AI acceleration.
  2. Implement an end-to-end data science and AI workflow: We want to innovate and deliver a rich suite of optimized tools for all your AI needs, including data preparation, training, inference, deployment, and scaling. Examples include AI Tools to accelerate end-to-end data science and machine learning pipelines, Intel® Distribution of OpenVINO™ toolkit to deploy high-performance inference applications from device to cloud, and BigDL to seamlessly scale your AI models to big data clusters with thousands of nodes for distributed training or inference.
  3. Deliver unmatched productivity and performance: We provide tools for deployment across diverse AI hardware by being built on the foundation of an open, standards-based, unified oneAPI programming model and constituent libraries. The multitude of hardware AI architectures in the market today, each with a separate software stack, make for an inefficient and unscalable approach for the developer ecosystem. The oneAPI industry initiative encourages cross-industry collaboration on the oneAPI specification to deliver a common developer experience across all accelerator architectures.

Software AI Accelerators in Deep Learning, Machine Learning, and Graph Analytics

Let us delve deeper into the first prong of our three-pronged AI strategy: software AI accelerators. Our extensive software optimization work provides a simple way for data scientists to efficiently implement their algorithms, which consist of graphs of operations or kernels. Our libraries and tools provide both kernel optimizations for individual operations (for example, effective use of Single Instruction Multiple Data [SIMD] registers, vectorization, cache-friendly data access when implementing convolution) and graph-level optimizations (using techniques such as batchnorm folding, conv/ReLU fusion and Conv/Sum fusion) across operations.

While some of you may find the implementation details interesting, we’ve done the heavy work for abstracting these optimizations for developers, so they don’t need to deal with the intricacies. Whether in deep learning, machine learning, or graph analytics, these Intel optimizations are designed for vast performance gains.

Deep Learning

Intel software optimizations through the Intel® oneAPI Deep Neural Network Library (oneDNN) deliver orders of magnitude performance gains to several popular deep learning frameworks and most of the optimizations have already been up-streamed into the default framework distributions. However, for TensorFlow and PyTorch we also maintain separate extensions from Intel as a buffer for advanced optimizations not yet up-streamed.
 

  • TensorFlow: Intel optimizations deliver a 16x gain in image classification inference and a 10x gain for object detection. The baseline is stock TensorFlow with basic Intel optimizations, upstreamed to functions in the TensorFlow Eigen library.

    Figure 3. Intel® Xeon® Platinum 8380 processor: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots, 64 GB, 3200) total DDR4 memory, uCode 0xd000280, hyperthreading: on, turbo: on, Ubuntu* 20.04.1 LTS, 5.4.0-73-generic1, Intel® SSD Optimizer Software drive, 900 GB; ResNet-50* v1.5, FP32/int8, BS=128; SSD MobileNet v1, FP32/int8, BS=448. Software: TensorFlow 2.4.0 for FP32 and TensorFlow optimizations from Intel (with an Intel® C++ Compiler base) for both FP32 and int8, test by Intel on May 12, 2021.
    Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.
     
  • PyTorch: Intel optimizations deliver a 53x gain for image classification and nearly 5x gain for a recommendation system. We have upstreamed most of our optimizations with oneDNN into PyTorch while also maintaining a separate Intel® Optimization for PyTorch* as a buffer for advanced optimizations that are not yet upstreamed. So, for this comparison, we created a new baseline by keeping PyTorch with only basic Intel optimizations without oneDNN.

    Figure 4. Intel Xeon Platinum 8380 processor: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots, 64 GB, 3200) total DDR4 memory, uCode 0xd000280, hyperthreading: on, turbo: on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel® SSD Optimizer Software drive 900 GB; ResNet-50 v1.5, FP32/int8, BS=128; deep learning recommendation model (DLRM), FP32/int8, BS=16. Software: PyTorch v1.5 without a Deep Neural Network Library (DNNL) build for FP32 and PyTorch v1.5 with Intel® Extension for PyTorch* (with Intel C++ Compiler) for both FP32 and int8, test by Intel on May 12, 2021.
    Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.

  • Apache MXNet*: Intel optimizations deliver 815x and 500x gains for image classification. The situation for MXNet is also different from TensorFlow and PyTorch. We upstreamed all our optimizations with oneDNN. So, for this comparison we created a new baseline without any Intel optimizations.

    Figure 5. Intel Xeon Platinum 8380 processor: 1-node, 2x Intel Xeon Platinum 8380 processor with 1 TB (16 slots, 64 GB, 3200) total DDR4 memory, uCode 0xd000280, hyperthreading: on, turbo: on, Ubuntu 20.04.1 LTS, 5.4.0-73-generic1, Intel® SSD Optimizer Software drive 900 GB; ResNet-50 v1, FP32/int8, BS=128; MobileNet v2, FP32/int8, BS=128. Software: MXNet 2.0.0.alpha without a DNNL build for FP32 and MXNet 2.0.0.alpha for both FP32 and int8, tested by Intel on May 12, 2021.
    Results may vary. For workloads and configurations visit www.Intel.com/PerformanceIndex.

Machine Learning

Scikit-learn is a popular machine learning software library for Python*. It features various classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, and K-means. We were able to enhance performance of these popular algorithms significantly by up to 100x-200x. These performance gains are available through the Intel® Extension for Scikit-learn* and the Intel® oneAPI Data Analytics Library (oneDAL).

Figure 6. Intel Xeon Platinum 8276L CPU at 2.20 GHz, 2 sockets, 28 cores per socket. For workloads and configurations visit www.Intel.com/PerformanceIndex. Details: Accelerate Your scikit-learn Applications, Save Time and Money with Intel Extension for Scikit-learn, and Use Intel Optimizations in scikit-learn.

Graph Analytics

Graph analytics refers to algorithms used to explore the strength and direction of relationships among entries in large graph databases such as social networks, the internet, X*, and Wikipedia. Examples of widely used graph analytics algorithms include single-source shortest path, breadth-first search, connected components, page rank, betweenness centrality, and triangle counting. As an example, Intel optimizations show significant improvement with the Triangle Counting Algorithm. The level of optimization increases as graphs get larger, with the largest performance gains of 166x for the largest graphs approaching 50 million vertices and 1.8 billion edges. For a more complete overview of Intel optimizations for several other graph analytics algorithms, see Measure Graph Analytics Performance.

Figure 7. Intel Xeon Platinum 8280 CPU at 2.70 GHz, 2 × 28 cores, hyperthreading: on; For workloads and configurations visit www.Intel.com/PerformanceIndex. Dataset: Stanford Large Network Dataset Collection

AI Everywhere: Applications for Software AI Acceleration

To solve a problem with AI requires an end-to-end workflow. We start from data and each use case has its unique AI data pipeline. An AI practitioner will have to ingest data, preprocess by feature engineering sometimes using machine learning, train the model using deep learning or machine learning, and then deploy the model. AI Tools provides high-performance APIs and Python packages to accelerate all phases of these pipelines and achieve big speed-ups through software AI acceleration. For an in-depth look at two real-world examples (US Census and the Photometric LSST Astronomical Time-series Classification Challenge [PLAsTiCC]) where AI Tools helps data scientists accelerate their AI pipelines, see Performance Optimizations for End-to-End Pipelines.

While we have seen that software AI accelerators are already delivering performance improvements critical to the growth of AI and its promulgation to every domain and use case, we have opportunities to do even more going forward. Intel is working on compiler technologies, memory optimizations, and distributed compute to drive further software AI acceleration. There are also opportunities for the entire AI software community to work together to truly unleash the power of software AI accelerators with Intel and other hardware vendors spearheading low-level software and framework optimizations and software vendors leading the higher-level optimizations, which can then come together with an industry-standard intermediate representation.

We would also like to encourage AI system builders to place a greater emphasis on software and for developers and practitioners to be relentless in their pursuit for AI performance acceleration opportunities:
 

  1. Always use the latest versions of deep learning and machine learning frameworks (TensorFlow, PyTorch, MXNet, XGBoost, scikit-learn, and others) that already have many of the intel optimizations upstreamed.
  2. For even greater performance, use the frameworks and tools from Intel that include all the latest optimizations and are fully compatible with your existing workflows.

Learn more about the drop-in framework optimizations and performance-optimized end-to-end tools that make up the AI software from Intel and supercharge your AI workflow.

Software AI accelerators in concert with continued hardware AI acceleration can finally get us to a future with AI Everywhere and a world that is smarter, more connected.

† This calculation is an approximation based on publicly available information regarding (1) Hours of streaming content and countries of operation for leading streaming providers such as but not limited to Netflix*, Amazon Prime*, Disney*, and others, and (2) Cost of using computer vision and natural language processing (NLP) AI services on leading US CSPs including but not limited to AWS, Microsoft Azure*, and Google Cloud*. This estimation is only intended to be used as an illustration of the scope of the problem and potential cost savings and Intel doesn’t guarantee its accuracy. Your costs and results may vary.

 

You May Also Like

Related Articles

Speeding Up the Databricks* Runtime for Machine Learning

One-Line Code Changes to Boost pandas, scikit-learn, and TensorFlow Performance

Related Videos

AI Analytics Part 1: Optimize End-to-End Data Science and Machine Learning Acceleration

AI Analytics Part 2: Enhance Deep Learning Workloads on 3rd Generation Intel Xeon Scalable Processors

AI Analytics Part 3: Walk through the Steps to Optimize End-to-End Machine Learning Workflows