Media Processing Basics Tuning Guide on 4th Gen Intel® Xeon® Scalable Processors

author-image

作者

Introduction

This guide is for users who are already familiar with media processing. It provides recommendations for configuring hardware and software that will yield reasonable baseline performance for generic media processing use cases on 4th Gen Intel® Xeon® Scalable processors.  However, media processing is a complex domain and optimal performance may require consideration beyond the scope of this tuning guide.

4th Gen Intel Xeon Scalable processors deliver workload-optimized performance with improved architecture and built-in acceleration for AI, encryption, HPC, storage, database systems, and networking. They feature unique security technologies to help protect data on-premises or in the cloud. 

Improvements that directly benefit basic media processing include increased core counts, memory performance with DDR5, and larger caches.

It’s not uncommon for applications to integrate basic media processing with artificial intelligence, content distribution, real-time streaming, or a variety of other functions.  Noteworthy features include:
 

  • New built-in accelerators for AI, HPC, networking, security, storage, and analytics
  • Intel® Ultra Path Interconnect (Intel® UPI)
  • Intel® Speed Select Technology
  • Hardware-enhanced security 
  • New flex bus I/O interface (PCIe* 5.0 + CXL) 
  • New flexible I/O interface up to 20 HSIO lanes (PCI 3.0)
  • Increased multisocket bandwidth with UPI 2.0 (up to 16 GT/s) 
  • Intel® Data Streaming Accelerator
     

Tuning guidance spans hardware, firmware, and software domains. Some parameters, like memory population, have only one mechanism of adjustment. Others, like scaling governors, can be adjusted through a variety of mechanisms, like via BOIS settings or operating system APIs. It’s not uncommon that a higher level of the solution stack will modify settings made lower in the stack. To ensure tuning settings intended are active during execution, we encourage instrumentation to read critical parameters at runtime. If runtime settings do not reflect specified tuning settings, it’s possible, even likely, that downstream firmware or software is changing tuning parameters.

Reference Workload

Intel employs an FFmpeg-based media transcode benchmark as a reference for general media processing. Tuning recommendations in this guide seek to provide good performance across the aggregate benchmark results. Intel does not license the benchmark, but replication is possible using the guidance below.

Use Cases

The benchmark implements file-based, single stream in, single stream out for 24 use cases spanning four codecs as illustrated in the table below.

Codec Input Resolution Xcode Type Preset ISA GOP Length (seconds) Frames to Encode Output Res (matches input) Output FPS (matches input) Output Bitrate (Mb/s) Max Bitrate (Mb/s) Buffer Size (Mb) Profile Pre-video Switches Other Switches Encoder Params
svt av1 FHD 1:1 5 AVX2 2 all FHD 60 4 12 16 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 FHD 1:1 5 AVX3 2 all FHD 60 4 12 16 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 FHD 1:1 8 AVX2 2 all UHD4Kc 60 4 24 48 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 FHD 1:1 8 AVX3 2 all UHD4Kc 60 4 24 20 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 FHD 1:1 12 AVX2 2 all UHD4Kc 60 4 24 20 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 FHD 1:1 12 AVX3 2 all UHD4Kc 60 4 24 20 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 UHD4Kc 1:1 8 AVX2 2 all UHD4Kc 60 9 18 36 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 UHD4Kc 1:1 8 AVX3 2 all UHD4Kc 60 9 18 36 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 UHD4Kc 1:1 12 AVX2 2 all FHD 60 9 18 16 n/a   -rc 1 -g 119-sc_detection 0  
svt av1 UHD4Kc 1:1 12 AVX3 2 all FHD 60 9 10 16 n/a   -rc 1 -g 119-sc_detection 0  
svt hevc FHD 1:1 1 AVX2 2 all UHD4Kc 60 5 18 20 Main   -rc 1 -g 119-sc_detection 0  
svt hevc FHD 1:1 5 AVX2 2 all UHD4Kc 60 5 18 36 Main   -rc 1 -g 119-sc_detection 0  
svt hevc FHD 1:1 5 AVX3 2 all UHD4Kc 60 5 18 36 Main   -rc 1 -g 119-sc_detection 0  
svt hevc FHD 1:1 9 AVX2 2 all UHD4Kc 60 5 18 36 Main   -rc 1 -g 119-sc_detection 0  
svt hevc UHD4Kc 1:1 1 AVX3 2 all FHD 60 12 10 36 Main10   -rc 1 -g 119-sc_detection 0  
svt hevc UHD4Kc 1:1 5 AVX3 2 all FHD 60 12 10 36 Main10   -rc 1 -g 119-sc_detection 0  
svt hevc UHD4Kc 1:1 9 AVX2 2 all FHD 60 12 24 36 Main10   -rc 1 -g 119-sc_detection 0  
svt hevc UHD4Kc 1:1 9 AVX3 2 all FHD 60 12 10 36 Main10   -rc 1 -g 119-sc_detection 0  
x264 FHD 1:1 fast AVX2 2 all FHD 60 6 8 24 High   -tune=psnr

keyint=120;min-keyint=120:sliced-threads=0;scene-cut=0;threads=4

x264 FHD 1:1 medium AVX2 2 all FHD 60 6 8 24 High   -tune=psnr keyint=120;min-keyint=120:sliced-threads=0;scene-cut=0;threads=4
x264 FHD 1:1 very slow AVX2 2 all FHD 60 5 8 48 High   -tune=psnr keyint=240;-min-keyint=240:sliced-threads=0;scene-cut=0;threads=8
x265 FHD 1:1 medium AVX3 2 all FHD 60 5 8 48 Main   -tune=psnr keyint=120,min-keyint=120:pools=4
x265 FHD 1:1 medium AVX2 2 all FHD 60 5 8 48 Main   -tune=psnr keyint=120,min-keyint=120:pools=4
x265 UHD4Kc 1:1 very slow AVX2 2 all FHD 60 12 8 48 Main10   -tune=psnr keyint=240;min-keyint=240;pools=8


Input File

The input file aggregates 10 scenes from a variety of video content types including animated, live, and gaming. Each scene has a leading key frame and length of 240 frames, or 4 seconds at 60 fps. The entire input video is 2400 frames, or 40 seconds at 60 fps. 

The input file is rendered at both FHD and consumer 4K resolution and matched to the output resolution of the use case for performance measurement purposes.

CPU Core Loading Methodology

Loading cores to ensure effective processor utilization (90%+) without thrashing the scheduler is important for accurate results. Intel dispatches FFmpeg instances based on the core counts. The formula is captured in the following table.

Codec

Resolution

MSO

Preset

FFMPEG Instances

Threads/Encode

x.264

FHD

1:1

very slow

Ceiling(Logical Cores/4)

8

x.264

FHD

1:1

medium

Ceiling(Logical Cores Div/2)

8

x.264

FHD

1:1

fast

Ceiling(Logical Cores Div/2)

8

 

 

 

 

 

 

x.265

UHD4Kc

1:1

very slow

Ceiling(Logical Cores/16)

32

x.265

FHD

1:1

medium

Ceiling(Logical Cores Div 4)

16

 

 

 

 

 

 

svt hevc

FHD

1:1

1

Ceiling(Logical Cores/8)

svt hevc

FHD

1:1

5

Ceiling(Logical Cores/8)

svt hevc

FHD

1:1

9

Ceiling(Logical Cores/4)

 

 

 

 

 

 

svt hevc

UHD4Kc

1:1

1

Ceiling(Logical Cores/12)

svt hevc

UHD4Kc

1:1

5

Ceiling(Logical Cores/12)

svt hevc

UHD4Kc

1:1

9

Ceiling(Logical Cores/12)

 

 

 

 

 

 

svt av1

FHD

1:1

12

Ceiling(Logical Cores/8)

svt av1

FHD

1:1

8

Ceiling(Logical Cores/8)

svt av1

FHD

1:1

5

Ceiling(Logical Cores/8)

svt av1

UHD4K

1:1

12

Ceiling(Logical Cores/12)

svt av1

UHD4K

1:1

8

Ceiling(Logical Cores/12)

† Set via encoder params in FFmpeg command line
‡ set automatically by the codec

Tuning

Platform Selection Considerations

Maximum Memory Speed

All processors in the 4th Gen Intel Xeon processor family are enabled for DDR5 memory but maximum speed is a function of the specific processor SKU. SKUs are available that support maximum speeds of 4000 mt/s, 4400 mt/s, and 4800 mt/s. Because media transcode generally benefits from faster memory speeds, Intel recommends selecting SKUs that support 4800 mt/s for best performance.

System Settings

The sections below describe parameters that can be set via the BIOS and/or operating system. Recommended settings yield good performance for the general media processing use case. The appropriate configuration for your application may vary.

IMPORTANT: Not all machines have the same mechanisms for setting performance. Techniques can vary widely by brand, model, architecture, and BIOS. When not familiar with a SUT, users are strongly encouraged to get guidance from an informed performance engineer.

Safe & Known Default

To help ensure a safe and known starting point, reset default settings in the BIOS and host operating system. BIOS reset is typically available in the BIOS subsystem; consult OEM guidance. Operating system reset is typically documented as part of the operating system distribution; consult the developer.

General Parameters

The six (6) parameters listed here are generally well-recognized dating back several years.  Recommended settings and a short description of each are provided.

Tuning Parameter

Typical* Location

Recommended Setting

Description

Power & Policy

BIOS

Performance

Optimizes system for performance.

CPU Frequency Governors

OS

Performance

CPU set to highest available frequency

Turbo Boost

BIOS

Enabled

Allows CPU to sustain max turbo frequency.

C-States

BIOS

Disabled

Prevents CPU transition to low power states.

Uncore Frequency

BIOS

Minimum

Ensure power to core is prioritized.

Hyperthreading

BIOS

Enabled

Enables one physical microprocessor to behave like two logical microprocessors.

* Not all machines have the same mechanism for setting tuning parameters. Locations are shown as general guidance but are neither prescriptive nor exclusive. Refer to hardware and software reference material for clarification. Instrumenting workloads to capture actual performance parameters at runtime is strongly encouraged. 

 

 

Homeless Prefetcher

The homeless prefetcher manages demand miss into mid-level cache. It should be disabled for 4-tile dies commonly referred to as extreme core count (XCC). The homeless prefetcher should be enabled for monolithic medium core count (MCC) dies that are typically preferred for professional media processing.

Sub-NUMA Cluster (SNC)

(Not needed but is a good example)

SNC is a feature that provides similar localization benefits as Cluster-On-Die (COD), a feature found in previous processor families, without some of COD’s downsides. SNC breaks up the last level cache (LLC) into disjoint clusters based on address range, with each cluster bound to a subset of the memory controllers in the system. SNC improves average latency to the LLC and is a replacement for the COD feature found in previous processor families.

Memory Configuration

Media transcoding workloads are sensitive to memory speed and configuration. Select the fastest memory supported on the architecture. Populate each memory channel to minimalize the distance data must travel to and from the CPU cores. The amount and size of the memory should be sized to accommodate the buffering requirements for the encoder, resolution, and desired quality. Details are discussed in the sections that follow.

Memory Speed

As mentioned previously, for best performance select 4th Gen Intel Xeon processor SKUs that support maximum speed of 4800 mt/s. Ensure your DDR5 DIMMs are 4800 mt/s or faster. Memory faster than 4800 mt/s will not improve performance since the CPU can’t go any faster. But memory speed slower than 4800 MT/s will slow the system down to match the speed of your memory.

DIMM Population

4th Gen Intel Xeon Scalable processor (formerly code named Sapphire Rapids) is an 8-channel memory architecture with DDR5 support up to 4800 MT/s. The memory controller supports up to 2 slots per channel. As a result, the typical mainboard will have 16 memory DIMM slots for each CPU (8 channels/CPU x 2 slots/channel=16 slots/CPU). For transcoding applications like the FFmpeg media benchmark, it is important to populate the computer with memory in each channel (NOT slot). Therefore, 4th Gen Intel Xeon Scalable processors should be populated with (a minimum of) 8 DIMMs per CPU meaning that every other slot can be empty. If you are working with a single CPU machine, you should have (at least) 8 DDR5 DIMMs. If you are working with a dual socket machine, you should have (at least) 16 DDR5 DIMMs.

Memory Sizing

Sizing the DIMMs can be more complicated. General guidance is to ensure at least 2 GB of free memory per logical core. With hyperthreading enabled, 4th Gen Intel Xeon Scalable processors will yield two logical cores for each physical core. For example, Intel Xeon 8468 processor is a 48C CPU intended for two-socket applications.  The 2S server provides 192 logical cores (2 Sockets * 48 physical cores/socket * 2 logical cores/physical core).  Allocating 2 GB of free memory per logical core will support most use cases up to 4K transcode.  The free memory target in this case is 384 GB (2GB/logical core * 192 logical cores).   

A couple of additional notes on memory sizing:  Many applications can get by with significantly less memory than 2 GB per logical core. Some applications may require more.   Firmware and operating systems carry memory overhead. Memory is a significant cost driver. Profiling specific instances of end-to-end (E2E) platforms is required to minimize cost while ensuring maximum performance.

Storage, Disk Configuration, and Settings

For file-based transcode, server-class SSDs will deliver adequate I/O performance. Other applications may benefit from the implementation of RAM-disks.

Network Configuration and Setting

Generally, offline media processing application are not network bandwidth limited. Live applications and adjacent use cases, like video production, will have varying requirements.

Related Tools and Information

There are a variety of mechanisms for setting or modifying tuning parameters.  Sometimes operating systems, tools, or applications may change parameters carefully set at system start-up. To ensure that settings at workload execution match intentions, it is recommended to query the system configuration using the Intel® System Health Inspector1, also known as svr-info, or the Intel® Power Thermal Utility2

 

Feedback

We value your feedback. If you have comments (positive or negative) on this guide or are seeking something that is not part of this guide, let us know.