High Bandwidth Memory

Overview

Each dense compute node on Cardinal contains 2 Intel Xeon CPU Max 9470. In addition to the DDR5 memory that is available on all other nodes on our systems, these CPUs also contain 128 GB of high bandwidth memory HBM2e which should especially speedup memory-bound codes.

HBM Modes

All nodes on Cardinal are configured clustering in SNC4 mode. This means that the 64 GB of HBM memory on a socket is further divided into 4 independent NUMA regions, each with 16 GB of HBM Memory. This is also true of the DDR memory, which is partitioned into NUMA-aware applications in particular will benefit from this configuration.

The HBM on these nodes can be configured in two modes: flat mode or cache mode. Nodes in the cpu partition on Cardinal are configured with memory in flat mode. A few nodes in the ? partition are configured with memory in cache mode.

Flat mode

In flat mode, HBM is visible to applications as addessable memory. On each node, NUMA nodes 0-7 correspond to DDR memory while nodes 8-15 corrrespond to the HBM. In order to use the HBM, the  numactl tool can be used to bind memory to the desired NUMA region. 

Cache mode

In cache mode, HBM is available to applications as a level 4 cache for DDR memory. This means that no changes are required to your application or submission script in order to utilize the HBM. Unlike flat mode, you do not have explicit control of when to use HBM vs DDR. This does, however, come at the cost of slightly lower performance for most applications due to higher latency for cache misses. However, if your application has a high rate of data reuse that fits in HBM, it may be a good candidate for running in cache mode.

Using HBM

Flat mode

The simplest way to ensure that your application uses HBM is to use numactl . We recommend using the --preferrred-many=8-15 flag to bind to the HBM memory. This ensures that your application will attempt to use the HBM memory if it is available. If your application requests more than the available 128 GB of HBM, it will allocate as much on HBM as fits and then allocate the rest on DDR memory. To enable your application to use HBM memory, first load the numactl/2.0.18 module and then prepend the appropriate numactl command to your run command as shown in the table below.

Execution Model DDR HBM
Serial ./a.out numactl --preferred-many=8-15 ./a.out
MPI srun ./a.out

srun numactl --preferred-many=8-15 ./a.out

For more fine-grained control, libraries such as libnumactl can be used to modify your code and explicitly set which memory is used to store data in your application.

Cache mode

If running on a node configured in cache mode, no modifications are necessary to your run script.

Supercomputer: 
Fields of Science: