High Bandwidth Memory

Overview

Each dense compute node on Cardinal contains 2 Intel Xeon CPU Max 9470. In addition to the DDR5 memory that is available on all other nodes on our systems, these CPUs also contain 128 GB of high bandwidth memory HBM2e which should especially speedup memory-bound codes.

HBM Modes

All nodes on Cardinal are configured clustering in SNC4 mode. This means that the 64 GB of HBM memory on a socket is further divided into 4 independent NUMA regions, each with 16 GB of HBM Memory. This is also true of the DDR memory, which is partitioned into NUMA-aware applications in particular will benefit from this configuration.

The HBM on these nodes can be configured in two modes: flat mode or cache mode. Nodes in the cpu partition on Cardinal are configured with memory in flat mode. A few nodes in the cache partition are configured with memory in cache mode.

Flat mode

In flat mode, HBM is visible to applications as addessable memory. On each node, NUMA nodes 0-7 correspond to DDR memory while nodes 8-15 corrrespond to the HBM. In order to use the HBM, the  numactl tool can be used to bind memory to the desired NUMA region.

All nodes in the cpu partition is configured in flat mode.

Cache mode

In cache mode, HBM is available to applications as a level 4 cache for DDR memory. This means that no changes are required to your application or submission script in order to utilize the HBM. Unlike flat mode, you do not have explicit control of when to use HBM vs DDR. This does, however, come at the cost of slightly lower performance for most applications due to higher latency for cache misses. However, if your application has a high rate of data reuse that fits in HBM, it may be a good candidate for running in cache mode.

There are currently 4 nodes configured in cache mode in the cache partition.

Using HBM

Flat mode

The simplest way to ensure that your application uses HBM is to use numactl . We recommend using the --preferrred-many=8-15 flag to bind to the HBM memory. This ensures that your application will attempt to use the HBM memory if it is available. If your application requests more than the available 128 GB of HBM, it will allocate as much on HBM as fits and then allocate the rest on DDR memory. To enable your application to use HBM memory, first load the numactl/2.0.18 module and then prepend the appropriate numactl command to your run command as shown in the table below.

Execution Model DDR HBM
Serial ./a.out numactl --preferred-many=8-15 ./a.out
MPI srun ./a.out

srun numactl --preferred-many=8-15 ./a.out

For more fine-grained control, libraries such as libnumactl can be used to modify your code and explicitly set which memory is used to store data in your application.

Cache mode

If running on a node configured in cache mode, no modifications are necessary to your run script.

Profiling HBM Usage

To check how much of the HBM memory is being used. We provide a wrapper script that can be used to generate logs of memory usage using numastat. The script is located at ~support/scripts/numastat_wrapper. To use it, prepend before numactl (or before the executable if not using numactl). For example, if you run with

srun numactl --preferred-many=8-15 ./a.out

then to use the wrapper, run

srun numastat_wrapper numactl --preferred-many=8-15 ./a.out

This will generate a logfile for each parallel process in the current run directory. By default, the logs will be updated every 10 seconds with new numastat information. Depending on length of your job this may generate a large number of log files. To change the sampling frequency, set the environment variable NUMASTAT_SAMPLE_INTERVALto how many seconds there should be between samples.

The script ~support/scripts/summarize-numastat-logs that can be used to gather information from the logs. For instance, if you ran a job with the numastat_wrapper and you should get log files called <jobname>.<jid>.<pid1>.log, <jobname>.<jid>.<pid2>.log, <jobname>.<jid>.<pid3>.log, etc, then you can  call summarize-numastat-logs <jobname>.<jid>.<pid1>.log . This will generate a file called <jobname>.<jid>.<pid1>.log.summary.txt. Other output file names can be select with the -o flag. If your output file is a .mp4 file then a video showing memory usage over time will be generated. Note that you can use the summary script even before your job has completed.

Supercomputer: 
Fields of Science: