Enable High-Bandwidth Memory in Future Intel® Processors

HPC software can achieve higher levels of performance from high-bandwidth memory (HBM) in the next-generation Intel® Xeon® Scalable processor. HBM is exposed to software using three memory modes: HBM-only, flat, and cache. To help you with adoption, this video covers software-enabling considerations associated with these memory modes.

Welcome everyone. I'm Ruchira and in this talk we are going to look at how to enable high bandwidth memory in next-generation Intel® Xeon® Scalable processor code named Sapphire Rapids plus HBM. This next-generation Intel® Xeon® processor is targeting high-bandwidth applications in HPC and AI. With four HBM2E stacks providing 64 gigabytes of total HBM capacity. That is in addition to eight channels of DDR5.

The main focus of this talk is to look at how software can use HBM. HBM is exposed to software using different memory modes, HBM-only mode, where only HBM is present. The flat mode where HBM and DDR exposed as two different NUMA nodes, and the cache mode where the HBM operates as a cache for DDR. These modes can be selected at boot time using the BIOS menu.

Now let's look at each of these memory modes in detail. First, HBM-only mode. HBM-only mode is available when no DDR is installed. In this mode, software sees a single flat memory region of 64 gigabytes per socket, or one NUMA node per socket. The advantage is you don't need any code changes to utilize this. You would use it as usual as you would run using DDR.

There is one thing you should be aware of that is the application, the operating system, and any other services you are using, all of that has to fit within a 64-gigabyte HBM capacity.

The second mode is flat mode. In this mode, both HBM and DDR are exposed as two separate NUMA nodes. For example, on the first socket DDR plus cores will be exposed as NUMA node zero and HBM will be exposed as NUMA node one. The default is DDR, so the operating system and background processes will use DDR. Allowing your application to use the entire HBM space.

The advantage of exposing this as NUMA node is that you can use started Linux* utilities to place your application in HBM. For example, you can use started Linux utility NUMA control with flag minus M or minus P to allocate your application in HBM. The difference between minus M and minus P is that with minus M you can't exceed the HBM capacity of 64 gigabytes with minus three. First, the allocations will go to HBM, then any additional allocations that overflow is there will go to the DDR so you can run applications larger than 64 gigabytes.

If you're using Intel® MPI to run your application instead of NUMA control, you can use and an environment variable provided by Intel MPI to place your application in HBM.

The third mode is the cache mode, there the HBM acts as a hardware-managed cache for DDR, which means contents of DDR gets cached into HBM. In this mode, software does not see HBM but it sees the entire DDR capacity and therefore you can run applications with footprints larger than 64 gigabytes. Another advantage of this method is that you don't need to change your software, you can run your program as usual, but there is one thing you should keep in mind.

That is, an HBM is organized as a direct map cache and because of that you can get conflict misses. That is, if two locations in DDR map to the same place in HBM, only one of those places can be held in HBM cache. So that's a conflict. Unfortunately, this can happen even when the footprint of an application is less than 64 gigabytes due to physical memory fragmentation. That is, physical memory pages can wind up anyway I DDR, and two of those pages can conflict.

These conflicts can lead to lower performance and performance variability in the next couple of slides. We will look at a couple of methods to preserve performance and to reduce our performance variability.

If you are using the cache mode to run a mix of applications, some with large memory footprints and some with small footprints. You can use this method to completely avoid conflict misses for applications with footprints more than 64 gigabytes and this is called fake NUMA. It is provided as a Linux kernel boot option. With this you can divide and address an space into smaller NUMA nodes. For example, in this picture we have a DDR space of 256 gigabyte, and we have divided the space into four.

Now each NUMA node is 64 gigabytes. Now the advantage of doing that is addresses within the 64-gigabyte node are guaranteed to be conflict free. So if you can press an application within one of those 64 gigabyte Newman nodes, you will see any conflict misses. So you can easily do that by using a utility like NUMA control and place your application within one of those 64 gigabyte NUMA nodes.

The second method is used to reduce performance variability in the cache mode, and it is called page randomization or page shuffling, and it is a kernel option provided by the Linux kernel.

With this option, the kernel uses a random-placement policy when it is allocating pages, and that leads to a more uniform distribution of pages across the entire physical memory space, and that leads to lower variability, and you can enable that with the boot options shown on the slide.

That's all for memory modes. So, let's look at some of the useful standard Linux NUMA utilities that are used in these memory modes. The first one is NUMA control with option minus H. NUMA control shows the NUMA configuration of an entire system, and with minus M and minus P, which we looked at before, you can pass an application in a given NUMA node.

The second utility is NUMA stat, which is useful for observing the memory consumption with option minus M you can see the memory consumption of all the NUMA nodes on the system and with minus P you can see the memory consumption of an individual process.

Similarly, there are a couple of Intel® tools that are useful for HBM enabling. The first one is the Intel® MPI Library, as you saw before, it provides an environment variable to place an application in HBM. The second one is the VTune™ Profiler. With that you can profile the memory bandwidth and memory consumption of an application.

Before we conclude, let's look at one other feature that can improve the performance of HBM. That's called sub-NUMA clustering. With sub-NUMA clustering, you can partition a socket into multiple NUMA nodes. The picture here shows four NUMA nodes, which is called SNC-4, and this is selected at boot with BIOS menu options. The advantage of this is that it provides higher bandwidth and lower latency, and it is available for all memory modes. To take advantage of this, you have to design applications to be NUMA aware and to take advantage of all four NUMA nodes. This is similar to NUMA optimizations you do for a four-socket system. For example, if you are using MPI, you have to run four MPI ranks, one MPI rank per each NUMA node.

Let's conclude by summarizing best practices we looked at.

First all modes perform best when the application footprint can fit within the HBM that is 64 gigabytes per socket. When you can't fit the memory footprint of an application on the socket, you can do a few things.

The first one is you can scale out the NUMA nodes for example by using MPI decomposition.

The second one is on the given socket. You can try data sharing instead of data replication. For data sharing you can use OpenMP*.

You can also try to reduce memory overheads by reducing the sizes of the operating system file caches or the communication buffers and this is especially useful for HBM-only memory mode.

Second, you should know about the memory consumption of your application, and you can use numastat or VTune Profiler to get an understanding about the memory consumption of your application.

Then you should evaluate your application in SNC-4 to see whether that gives you a performance benefit. Finally, if you are using the cache mode you should use fake NUMA and page shuffling to get better performance. That's how you get the best performance out of the HBM and thank you all for attending.