Performance Data for Intel® AI Data Center Products

Find the latest AI benchmark performance data for Intel Data Center products, including detailed hardware and software configurations.

Pretrained models, sample scripts, best practices, and tutorials

Measurements were taken using:

4th Generation Intel® Xeon® Scalable Processors

Intel® Xeon® Platinum 8480+ Processor (56 Cores)

Deep Learning Inference

Framework Version	Model	Usage	Precision	Throughput	Perf/Watt	Latency(ms)	Batch size	Config*
Intel PyTorch 2.1 DeepSpeed	GPT-J 6B Token size 1024/128	text-generation, Beam Search, Width=4	int8			40	1	1 instance per socket
Intel PyTorch 2.1 DeepSpeed	GPT-J 6B Token size 1024/128	text-generation, Beam Search, Width=4	int8	130.4 tokens/s		92	6	1 instance per socket
Intel PyTorch 2.1 DeepSpeed	GPT-J 6B Token size 1024/128	text-generation, Beam Search, Width=4	bf16			59.5	1	1 instance per socket
Intel PyTorch 2.1 DeepSpeed	GPT-J 6B Token size 1024/128	text-generation, Beam Search, Width=4	bf16	125 tokens/s		96	6	1 instance per socket
MLPerf Inference v3.1	GPT-J (offline, 99.0% acc)	Large Language Model	int8	2.05 samp/s			7	4 cores per instance
Intel PyTorch 2.1 DeepSpeed	LLaMA2-7B Token size 1024/128	text-generation, Beam Search, Width=4	int8			47	1	1 instance per socket
Intel PyTorch 2.1 DeepSpeed	LLaMA2-7B Token size 1024/128	text-generation, Beam Search, Width=4	int8	111.6 tokens/s		107.5	6	1 instance per socket
Intel PyTorch 2.1 DeepSpeed	LLaMA2-7B Token size 1024/128	text-generation, Beam Search, Width=4	bf16			68	1	1 instance per socket
Intel PyTorch 2.1 DeepSpeed	LLaMA2-7B Token size 1024/128	text-generation, Beam Search, Width=4	bf16	109.1 tokens/s		110	6	1 instance per socket
MLPerf Inference v3.1	ResNet50 v1.5 (offline)	Image Recognition	int8	20,565.5 samp/s			256	1 core per instance
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	int8	10,215.7 img/s	9.98		1	4 cores per instance
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	int8	13,862.96 img/s	14.09		116	1 instance per socket
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	bf16	6,210.69 img/s	6.13		1	4 cores per instance
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	bf16	7,295.63 img/s	7.33		116	1 instance per socket
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	fp32	1,319.52 img/s	1.27		1	4 cores per instance
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	fp32	1,360.05 img/s	1.28		116	1 instance per socket
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	bf32	1,659.37 img/s	1.65		1	4 cores per instance
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	bf32	1,985.26 img/s	2.02		116	1 instance per socket
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	int8	7,440.61 img/s	7.70		1	4 cores per instance
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	int8	12,345.54 img/s	11.80		116	1 instance per socket
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	bf16	5,053.76 img/s	5.01		1	4 cores per instance
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	bf16	6,704.17 img/s	6.34		116	1 instance per socket
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	fp32	1,282.77 img/s	1.17		1	4 cores per instance
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	fp32	1,342.91 img/s	1.27		116	1 instance per socket
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	bf32	1,529.49 img/s	1.41		1	4 cores per instance
Intel TensorFlow 2.14	ResNet50 v1.5	Image Recognition	bf32	2,017.54 img/s	1.89		116	1 instance per socket
OpenVINO 2023.2	ResNet50 v1.5	Image Recognition	int8	8,819.657 img/s	8.81		1	4 cores per instance
OpenVINO 2023.2	ResNet50 v1.5	Image Recognition	bf16	5,915.793 img/s	5.82		1	4 cores per instance
OpenVINO 2023.2	ResNet50 v1.5	Image Recognition	fp32	1,281.337 img/s	1.25		1	4 cores per instance
MLPerf Inference v3.1	BERT-Large (offline, 99.0% acc)	Natural Language Processing	int8	1,357.33 samp/s			1,300	4 cores per instance
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	int8	335.1 sent/s	0.35		1	4 cores per instance
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	int8	378.73 sent/s	0.36		56	1 instance per socket
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	bf16	204.52 sent/s	0.21		1	4 cores per instance
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	bf16	201.44 sent/s	0.21		16	1 instance per socket
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	fp32	35.25 sent/s	0.03		1	4 cores per instance
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	fp32	41.05 sent/s	0.04		56	1 instance per socket
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	bf32	72.42 sent/s	0.07		1	4 cores per instance
Intel PyTorch 2.1	BERTLarge	Natural Language Processing	bf32	71.63 sent/s	0.07		16	1 instance per socket
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	int8	253.27 sent/s	0.24		1	4 cores per instance
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	int8	239.89 sent/s	0.25		16	1 instance per socket
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	bf16	181.02 sent/s	0.18		1	4 cores per instance
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	bf16	184.06 sent/s	0.17		128	1 instance per socket
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	fp32	44.73 sent/s	0.04		1	4 cores per instance
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	fp32	38.58 sent/s	0.04		16	1 instance per socket
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	bf32	72.78 sent/s	0.07		1	4 cores per instance
Intel TensorFlow 2.14	BERTLarge	Natural Language Processing	bf32	71.77 sent/s	0.07		16	1 instance per socket
OpenVINO 2023.2	BERTLarge	Natural Language Processing	int8	298.44 sent/s	0.30		1	4 cores per instance
OpenVINO 2023.2	BERTLarge	Natural Language Processing	int8	285.68 sent/s	0.28		48	1 instance per socket
OpenVINO 2023.2	BERTLarge	Natural Language Processing	bf16	202.48 sent/s	0.20		1	4 cores per instance
OpenVINO 2023.2	BERTLarge	Natural Language Processing	bf16	191.2533 sent/s	0.19		32	1 instance per socket
OpenVINO 2023.2	BERTLarge	Natural Language Processing	fp32	47.33667 sent/s	0.05		1	4 cores per instance
OpenVINO 2023.2	BERTLarge	Natural Language Processing	fp32	44.23333 sent/s	0.04		48	1 instance per socket
MLPerf Inference v3.1	DLRM-v2 (offline, 99.0% acc)	Recommender	int8	5,367.77 samp/s			300	1 core per instance
Intel PyTorch 2.1	DLRM Criteo Terabyte	Recommender	int8	23,444,587 rec/s	23611.92		128	1 instance per socket
Intel PyTorch 2.1	DLRM Criteo Terabyte	Recommender	bf16	10,646,560 rec/s	10238.88		128	1 instance per socket
Intel PyTorch 2.1	DLRM Criteo Terabyte	Recommender	fp32	2,278,228 rec/s	2220.37		128	1 instance per socket
Intel PyTorch 2.1	DLRM Criteo Terabyte	Recommender	bf32	4,530,200 rec/s	4427.38		128	1 instance per socket
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	int8	4,726.15 sent/s	4.94		1	4 cores per instance
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	int8	7,759.25 sent/s	8.42		168	1 instance per socket
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	bf16	3,306.46 sent/s	3.35		1	4 cores per instance
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	bf16	5,057.47 sent/s	5.50		120	1 instance per socket
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	fp32	900.58 sent/s	0.85		1	4 cores per instance
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	fp32	1,007.05 sent/s	1.04		56	1 instance per socket
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	bf32	1,513.66 sent/s	1.49		1	4 cores per instance
Intel PyTorch 2.1	DistilBERT	Natural Language Processing	bf32	1,926.1 sent/s	1.77		288	1 instance per socket
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	int8	61.03 sent/s	0.06		1	4 cores per instance
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	int8	245.66 sent/s	0.24		448	1 instance per socket
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	bf16	41.44 sent/s	0.04		1	4 cores per instance
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	bf16	278.81 sent/s	0.28		448	1 instance per socket
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	fp32	20.27 sent/s	0.02		1	4 cores per instance
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	fp32	102.48 sent/s	0.10		448	1 instance per socket
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	bf32	20.28 sent/s	0.02		1	4 cores per instance
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	bf32	114.08 sent/s	0.11		448	1 instance per socket
OpenVINO 2023.2	3D-Unet	Image Segmentation	int8	24.68333 samp/s	0.02		1	4 cores per instance
OpenVINO 2023.2	3D-Unet	Image Segmentation	int8	21.85667 samp/s	0.02		6	1 instance per socket
OpenVINO 2023.2	3D-Unet	Image Segmentation	bf16	13.05333 samp/s	0.01		1	4 cores per instance
OpenVINO 2023.2	3D-Unet	Image Segmentation	bf16	11.87 samp/s	0.01		6	1 instance per socket
OpenVINO 2023.2	3D-Unet	Image Segmentation	fp32	2.883333 samp/s	0.00		1	4 cores per instance
OpenVINO 2023.2	3D-Unet	Image Segmentation	fp32	2.62 samp/s	0.00		6	1 instance per socket
OpenVINO 2023.2	SSD-ResNet34 COCO 2017 (1200 x1200)	Object Detection	int8	459.3633 img/s	0.44		1	4 cores per instance
OpenVINO 2023.2	SSD-ResNet34 COCO 2017 (1200 x1200)	Object Detection	bf16	218.4133 img/s	0.20		1	4 cores per instance
OpenVINO 2023.2	SSD-ResNet34 COCO 2017 (1200 x1200)	Object Detection	fp32	31.17333 img/s	0.03		1	4 cores per instance
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	int8	1289.95 fps	1.35		1	4 cores per instance
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	int8	1923.77 fps	1.83		116	1 instance per socket
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	bf16	648.58 fps	0.66		1	4 cores per instance
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	bf16	867.05 fps	0.87		64	1 instance per socket
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	fp32	151.29 fps	0.14		1	4 cores per instance
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	fp32	160.93 fps	0.15		64	1 instance per socket
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	bf32	215.11 fps	0.21		1	4 cores per instance
Intel PyTorch 2.1	ResNeXt101 32x16d ImageNet	Image Classification	bf32	241.98 fps	0.22		116	1 instance per socket
MLPerf Inference v3.1	RetinaNet (offline)	Object Detection	int8	284.75 samp/s			2	4 cores per instance
MLPerf Inference v3.1	RNN-T (offline)	Speech-to-text	int8+bf16	5,782.18 samp/s			256	4 cores per instance

Training

MLPerf* Training

Framework Version	Model/Dataset	Usage	Precision	Sockets / Nodes	Total Time to Train (min)	Throughput	Batch Size
MLPerf Training v3.1	BERT-Large (seq len=512) / Wikipedia 2020/01/01	Natural Language Processing	bf16	32 Sockets / 16 Nodes	47.93		3,072
MLPerf Training v3.1	DLRM-v2 / Criteo 4TB multi-hot	Recommendation	bf16	8 Sockets / 4 Nodes	227.14		65,536
MLPerf Training v3.1	ResNet-50 / ImageNet	Image Classification	bf16	32 Sockets / 16 Nodes	88.56	8,584.5 img/s	3,264
MLPerf Training v3.1	RetinaNet / Open Images	Object Detection	bf16	32 Sockets / 16 Nodes	232.4	351.2 img/s	256

Transfer Learning / Fine Tuning

Framework Version	Model	Usage	Precision	TTT (minutes)	Accuray	Batch Size	Ranks
Transformers 4.31, Intel Extension for Pytorch 2.0.1, PEFT 0.4.0	GPT-J 6B (Glue MNLI dataset)	Fine-turning, Text generation task	bf16	230.40	81.6	8	1
Transformers 4.34.1, Intel PyTorch 2.1.0, PEFT 0.5.0, Intel(r) oneCCL v2.1.0	BioGPT (1.5 billion parameters) (PubMedQA dataset)	Fine-turning, Response generation	bf16	48.70	79.4	8	8
Intel(r) Tensorflow 2.14, horovod 0.28, Open MPI 4.1.2, Python 3.10.0	ResNet50 v1.50 (Colorectal histology dataset)	Fine-turning, Colorectal cancer detection	fp32	8.83	94.3	32	64
Intel(r) Tensorflow 2.14, horovod 0.28, Open MPI 4.1.2, Python 3.10.0	ResNet50 v1.50 (Colorectal histology dataset)	Fine-turning, Colorectal cancer detection	bf16	4.65	94.3	32	64
Intel(r) Tensorflow 2.14, horovod 0.28, Open MPI 4.1.2, Python 3.10.0	ResNet50 v1.50 (Colorectal histology dataset)	Fine-turning, Colorectal cancer detection	fp32	6.04	93.8	32	128
Intel(r) Tensorflow 2.14, horovod 0.28, Open MPI 4.1.2, Python 3.10.0	ResNet50 v1.50 (Colorectal histology dataset)	Fine-turning, Colorectal cancer detection	bf16	4.02	94.6	32	128
Transformers 4.35.0, Intel PyTorch 2.0.100, Intel® oneCCL 2.0.100	BERTLarge Uncased (IMDb dataset)	Fine-turning, Sentiment Analysis	fp32	61.72	93.59	64	4
Transformers 4.35.0, Intel PyTorch 2.0.100, Intel® oneCCL 2.0.100	BERTLarge Uncased (IMDb dataset)	Fine-turning, Sentiment Analysis	bf16	18.86	93.88	64	4
Transformers 4.35.0, Intel PyTorch 2.0.100, Intel® oneCCL 2.0.100	BERTLarge Uncased (GLUE SST2 dataset)	Fine-turning, Sentiment Analysis	fp32	14.06	92.2	256	4
Transformers 4.35.0, Intel PyTorch 2.0.100, Intel® oneCCL 2.0.100	BERTLarge Uncased (GLUE SST2 dataset)	Fine-turning, Sentiment Analysis	bf16	3.68	92.09	256	4

Training Throughput

Framework Version	Model/Dataset	Usage	Precision	Throughput	Perf/Watt	Batch size
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	fp32	129.97 img/s	0.161257103	128
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	bf16	327.96 img/s	0.420294498	128
Intel PyTorch 2.1	ResNet50 v1.5	Image Recognition	bf32	146.18 img/s	0.180068983	128
Intel TensorFlow 2.14	ResNet50 v1.5 ImageNet (224 x224)	Image Recognition	fp32	137.36 img/s	0.163106335	1024
Intel TensorFlow 2.14	ResNet50 v1.5 ImageNet (224 x224)	Image Recognition	bf16	317.83 img/s	0.377479275	1024
Intel TensorFlow 2.14	ResNet50 v1.5 ImageNet (224 x224)	Image Recognition	bf32	152 img/s	0.180806014	1024
Intel PyTorch 2.1	DLRM Criteo Terabyte	Recommender	fp32	265,503.91 rec/s	323.9907136	32768
Intel PyTorch 2.1	DLRM Criteo Terabyte	Recommender	bf16	783,058.09 rec/s	980.36669	32768
Intel PyTorch 2.1	DLRM Criteo Terabyte	Recommender	bf32	369,848.15 rec/s	447.8448004	32768
Intel TensorFlow 2.14	SSD-ResNet34 COCO 2017 (1200 x1200)	Object Detection	fp32	52.49 img/s	0.069833963	896
Intel TensorFlow 2.14	SSD-ResNet34 COCO 2017 (1200 x1200)	Object Detection	bf16	190.53 img/s	0.251641022	896
Intel TensorFlow 2.14	SSD-ResNet34 COCO 2017 (1200 x1200)	Object Detection	bf32	68.08 img/s	0.089474168	896
Intel PyTorch 2.1	RNNT LibriSpeech	Speech Recognition	fp32	3.38 fps	0.00431651	32
Intel PyTorch 2.1	RNNT LibriSpeech	Speech Recognition	bf16	27.32 fps	0.032853123	64
Intel PyTorch 2.1	RNNT LibriSpeech	Speech Recognition	bf32	11.05 fps	0.013210908	32
Intel PyTorch 2.1	MaskR-CNN COCO 2017	Object Detection	fp32	3.76 img/s	0.004518796	112
Intel PyTorch 2.1	MaskR-CNN COCO 2017	Object Detection	bf16	10.04 img/s	0.011990064	112
Intel PyTorch 2.1	MaskR-CNN COCO 2017	Object Detection	bf32	3.94 img/s	0.004759719	112
Intel PyTorch 2.1	BERTLarge Wikipedia 2020/01/01 seq len=512	Natural Language Processing	fp32	3.76 sent/s	0.004518796	28
Intel PyTorch 1.13	BERTLarge Wikipedia 2020/01/01 seq len=512	Natural Language Processing	bf16	10.04 sent/s	0.011990064	56
Intel PyTorch 1.13	BERTLarge Wikipedia 2020/01/01 seq len=512	Natural Language Processing	bf32	3.94 sent/s	0.004759719	56
Intel TensorFlow 2.14	BERTLarge Wikipedia 2020/01/01 seq len=512	Natural Language Processing	fp32	4.28 sent/s	0.00517258	128
Intel TensorFlow 2.14	BERTLarge Wikipedia 2020/01/01 seq len=512	Natural Language Processing	bf16	9.75 sent/s	0.011582599	128
Intel TensorFlow 2.14	BERTLarge Wikipedia 2020/01/01 seq len=512	Natural Language Processing	bf32	4.79 sent/s	0.005754722	128
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	fp32	12,072.19 sent/s	11.53	42000
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	bf16	28757.83 sent/s	28.89	42000
Intel TensorFlow 2.14	Transformer MLPerf	Language Translation	bf32	11, 995.37 sent/s	11.78	42000

Hardware and software configuration (measured October 24, 2023):

Deep learning configuration:

Hardware configuration for Intel® Xeon® Platinum 8480+ processor (formerly code named Sapphire Rapids): 2 sockets for inference, 1 socket for training, 56 cores, 350 watts, 1024GB 16 x 64GB DDR5 4800 MT/s memory, operating system CentOS* Stream 8. Using Intel® Advanced Matrix Extensions (Intel® AMX) int8 and bf16 with Intel® oneAPI Deep Neural Network Library (oneDNN) optimized kernels integrated into Intel® Extension for PyTorch*, Intel® Extension for TensorFlow*, and Intel® Distribution of OpenVINO™ toolkit. Measurements may vary. If the dataset is not listed, a synthetic dataset was used to measure performance.
If the dataset is not listed, a synthetic dataset was used to measure performance. Accuracy (if listed) was validated with the specified dataset.

Transfer learning configuration:

Hardware configuration for Intel® Xeon® Platinum 8480+ processor (formerly code named Sapphire Rapids): Use DLSA single node fine tuning, Vision Transfer Learning using single node, 56 cores, 350 watts, 16 x 64 GB DDR5 4800 memory, BIOS version EGSDREL1.SYS.8612.P03.2208120629, operating system: Ubuntu 22.04.1 LT, using Intel® Advanced Matrix Extensions (Intel® AMX) int8 and bf16 with Intel® oneAPI Deep Neural Network Library (oneDNN) v2.6 optimized kernels integrated into Intel® Extension for PyTorch* v1.12, and Intel® oneAPI Collective Communications Library v2021.5.2. Measurements and some software configurations may vary.

MLPerf* configuration:

Hardware configuration for MLPerf* Inference v3.1 measurements on Intel® Xeon® Platinum 8480+ processor (formerly code named Sapphire Rapids): 2 sockets for inference, 56 cores, 350 watts, 1024 GB 16 x 64 GB DDR5-4800 MT/s memory, operating system: CentOS* Stream 8. Using Intel® Advanced Matrix Extensions (Intel® AMX) int4, int8, and bf16 with Intel® oneAPI Deep Neural Network Library (oneDNN) optimized kernels integrated into Intel® Extension for PyTorch*. Measurements may vary. The model specifications and datasets used for MLPerf workloads are specified by MLCommons and viewable at MLPerf Inference: Datacenter Benchmark Suite Results.

选择您的语言

使用 Intel.com 搜索

快速链接

最近搜索

高级搜索

仅搜索