Intel® Neural Compressor
Speed Up AI Inference without Sacrificing Accuracy
Deploy More Efficient Deep Learning Models
Intel® Neural Compressor performs model compression to reduce the model size and increase the speed of deep learning inference for deployment on CPUs or GPUs. This open source Python* library automates popular model compression technologies, such as quantization, pruning, and knowledge distillation across multiple deep learning frameworks.
Using this library, you can:
- Converge quickly on quantized models though automatic accuracy-driven tuning strategies.
- Prune model weights by specifying predefined sparsity goals that drive pruning algorithms.
- Distill knowledge from a larger network (“teacher”) to train a smaller network (“student”) to mimic its performance with minimal precision loss.
Download as Part of the Toolkit
Intel Neural Compressor is available in the Intel® AI Analytics Toolkit (AI Kit), which provides accelerated machine learning and data analytics pipelines with optimized deep learning frameworks and high-performing Python libraries.
Download the Stand-Alone Version
A stand-alone download of Intel Neural Compressor is available. You can download binaries from Intel or choose your preferred repository.
Develop in the Free Intel® Cloud
Get what you need to build and optimize your oneAPI projects for free. With an Intel® Developer Cloud account, you get 120 days of access to the latest Intel® hardware—CPUs, GPUs, FPGAs—and Intel® oneAPI tools and frameworks. No software downloads. No configuration steps. No installations.
Features
Model Compression Techniques
- Quantize data and computation to int8, bfloat16, or a mixture of FP32, BF16, and int8 to reduce model size and to speed inference while minimizing precision loss. Quantize during training and posttraining, or dynamically based on the runtime data range.
- Prune parameters that have minimal effect on accuracy to reduce the size of a network. Discard weights in structured or unstructured sparsity patterns, or remove filters or layers according to specified rules.
- Distill knowledge from a teacher network to a student network to improve the accuracy of the compressed model.
Built-in Strategies
- Automatically optimize models using recipes of model compression techniques to achieve objectives with expected accuracy criteria.
APIs for TensorFlow*, PyTorch*, Apache MXNet*, and Open Neural Network Exchange Runtime (ONNXRT) Frameworks
- Get started quickly with built-in DataLoaders for popular industry dataset objects or register your own dataset.
- Preprocess input data using built-in methods such as resize, crop, normalize, transpose, flip, pad, and more.
- Configure model objectives and evaluation metrics without writing framework-specific code.
- Analyze the graph and tensor after each tuning run with TensorBoard*.
Case Studies
Accelerating Alibaba* Transformer Model Performance
Alibaba Group* and Intel collaborated to explore and deploy their AI int8 models on platforms that are based on 3rd generation Intel® Xeon® Scalable processors.
CERN Uses Intel® Deep Learning Boost and oneAPI to Juice Inference without Accuracy Loss
Researchers at CERN demonstrated success in accelerating inferencing nearly twofold by using reduced precision without compromising accuracy.
A 3D Digital Face Reconstruction Solution Enabled by 3rd Generation Intel® Xeon® Scalable Processors
By quantizing the Position Map Regression Network from an FP32-based inference down to int8, Tencent Games* improved inference efficiency and provided a practical solution for 3D digital face reconstruction.
Demonstrations
Quantize ONNX* Models
Learn how to quantize MobileNet* v2 in the ONNX* framework using Intel Neural Compressor. Show accuracy versus performance results for a variety of models that are based on ONNX.
AI Inference Acceleration on CPUs
Deploying a trained model for inference often requires modification, optimization, and simplification based on where it is being deployed. This overview of Intel’s end-to-end solution includes a downloadable neural style transfer demonstration.
Accelerate AI Inference without Sacrificing Accuracy
This webinar provides an overview of available model compression techniques and demonstrates an end-to-end quantization workflow.
Documentation & Code Samples
Code Samples
Specifications
Processor:
- Intel® Xeon® processors
- Intel Xeon Scalable processors
- Intel® Arc™ GPUs
Operating systems:
- Linux*
- Windows*
Languages:
- Python
Get Help
Your success is our success. Access this support resource when you need assistance.
For additional help, see the general oneAPI Support.
Related Products
Stay in the Know with All Things CODE
Sign up to receive the latest trends, tutorials, tools, training, and more to
help you write better code optimized for CPUs, GPUs, FPGAs, and other
accelerators—stand-alone or in any combination.