Cardinal Programming Environment

Compilers

The Cardinal cluster supports C, C++, and Fortran programming languages. The available compiler suites include Intel, oneAPI, and GCC. By default, the Intel development toolchain is loaded. The table below lists the compiler commands and recommended options for compiling serial programs. For more details and best practices, please refer to our compilation guide.

The Sapphire Rapids processors that make up Cardinal support the Advanced Vector Extensions (AVX512) instruction set, but you must set the correct compiler flags to take advantage of it. AVX512 has the potential to speed up your code by a factor of 8 or more, depending on the compiler and options you would otherwise use. However, bear in mind that clock speeds decrease as the level of the instruction set increases. So, if your code does not benefit from vectorization it may be beneficial to use a lower instruction set.

In our experience, the Intel compiler usually does the best job of optimizing numerical codes and we recommend that you give it a try if you’ve been using another compiler.

With the Intel or oneAPI compilers, use -xHost and -O2 or higher. With the GNU compilers, use -march=native and -O3.

This advice assumes that you are building and running your code on Cardinal. The executables will not be portable. Of course, any highly optimized builds, such as those employing the options above, should be thoroughly validated for correctness.

LANGUAGE	INTEL	GNU	ONEAPI
C	`icc -O2 -xHost hello.c`	`gcc -O3 -march=native hello.c`	`icx -O2 -xHost hello.c`
Fortran	`ifort -O2 -xHost hello.F`	`gfortran -O3 -march=native hello.F`	`ifx -O2 -xHost hello.F`
C++	`icpc -O2 -xHost hello.cpp`	`g++ -O3 -march=native hello.cpp`	`icpx -O2 -xHost hello.cpp`

Parallel Programming

MPI

By default, OSC systems use the MVAPICH implementation of the Message Passing Interface (MPI), which is optimized for high-speed InfiniBand interconnects. MPI is a standardized library designed for parallel processing in distributed-memory environments. OSC also supports OpenMPI and Intel MPI. For more information on building MPI applications, please visit the MPI software page.

MPI programs are started with the srun command. For example,

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

srun [ options ] mpi_prog

Note: The program to be run must either be in your path or have its full path specified.

The above job script will allocate 2 CPU nodes with 8 CPU cores each. The srun command will typically spawn one MPI process per task requested in a Slurm batch job. Use the --ntasks-per-node=n option to change that behavior. For example,

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

# Run 8 processes per node
srun ./mpi_prog

# Run 4 processes per node
srun --ntasks=8 --ntasks-per-node=4 ./mpi_prog

Note: The information above applies to the MVAPICH, Intel MPI and OpenMPI installations at OSC.

Caution: mpiexec or mpirun is still supported with Intel MPI and OpenMPI, but it may not be fully compatible with our Slurm environment. We recommend using srun in all cases.

OpenMP

The Intel, oneAPI and GNU compilers understand the OpenMP set of directives, which support multithreaded programming. For more information on building OpenMP codes on OSC systems, please visit the OpenMP documentation.

An OpenMP program by default will use a number of threads equal to the number of CPUs requested in a Slurm batch job. To use a different number of threads, set the environment variable OMP_NUM_THREADS. For example,

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8

# Run 8 threads
./omp_prog

# Run 4 threads
export OMP_NUM_THREADS=4
./omp_prog

To run a OpenMP job on an exclusive node:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive

./omp_prog

Hybrid (MPI + OpenMP)

An example of running a job for hybrid code:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --exclusive

# Each Cardinal node is equipped with 96 CPU cores
# Run 8 MPI processes on each node and 12 OpenMP threads spawned from a MPI process
export OMP_NUM_THREADS=12
srun --ntasks=16 --ntasks-per-node=8 --cpus-per-task=12 ./hybrid_prog

Tuning Parallel Program Performance: Process/Thread Placement

To get the maximum performance, it is important to make sure that processes/threads are located as close as possible to their data, and as close as possible to each other if they need to work on the same piece of data, with given the arrangement of node, sockets, and cores, with different access to RAM and caches.

While cache and memory contention between threads/processes are an issue, it is best to use scatter distribution for code.

Processes and threads are placed differently depending on the computing resources you requste and the compiler and MPI implementation used to compile your code. For the former, see the above examples to learn how to run a job on exclusive nodes. For the latter, this section summarizes the default behavior and how to modify placement.

OpenMP only

For all three compilers (Intel, GCC and oneAPI), purely threaded codes do not bind to particular CPU cores by default. In other words, it is possible that multiple threads are bound to the same CPU core.

The following table describes how to modify the default placements for pure threaded code:

DISTRIBUTION	Compact	Scatter/Cyclic
DESCRIPTION	Place threads as closely as possible on sockets	Distribute threads as evenly as possible across sockets
INTEL/ONEAPI	KMP_AFFINITY=compact	KMP_AFFINITY=scatter
GNU	OMP_PLACES=sockets^[1]	OMP_PROC_BIND=true OMP_PLACES=cores

Threads in the same socket might be bound to the same CPU core.

MPI Only

For MPI-only codes, MVAPICH first binds as many processes as possible on one socket, then allocates the remaining processes on the second socket so that consecutive tasks are near each other. Intel MPI and OpenMPI alternately bind processes on socket 1, socket 2, socket 1, socket 2 etc, as cyclic distribution.

For process distribution across nodes, all MPIs first bind as many processes as possible on one node, then allocates the remaining processes on the second node.

The following table describe how to modify the default placements on a single node for MPI-only code with the command srun:

DISTRIBUTION (single node)	Compact	Scatter/Cyclic
DESCRIPTION	Place processs as closely as possible on sockets	Distribute process as evenly as possible across sockets
MVAPICH^[1]	Default	MVP_CPU_BINDING_POLICY=scatter
INTEL MPI	SLURM_DISTRIBUTION=block:block srun -B "2:*:1" ./mpi_prog	SLURM_DISTRIBUTION=block:cyclic srun -B "2:*:1" ./mpi_prog
OPENMPI	SLURM_DISTRIBUTION=block:block srun -B "2:*:1" ./mpi_prog	SLURM_DISTRIBUTION=block:cyclic srun -B "2:*:1" ./mpi_prog

MVP_CPU_BINDING_POLICY will not work if MVP_ENABLE_AFFINITY=0 is set.
To distribute processes evenly across nodes, please set SLURM_DISTRIBUTION=cyclic.

Hybrid (MPI + OpenMP)

For hybrid codes, each MPI process is allocated a number of cores defined by OMP_NUM_THREADS, and the threads of each process are bound to those cores. All MPI processes, along with the threads bound to them, behave similarly to what was described in the previous sections.

The following table describe how to modify the default placements on a single node for Hybrid code with the command srun:

DISTRIBUTION (single node)	Compact	Scatter/Cyclic
DESCRIPTION	Place processs as closely as possible on sockets	Distribute process as evenly as possible across sockets
MVAPICH^[1]	Default	MVP_HYBRID_BINDING_POLICY=scatter
INTEL MPI^[2]	SLURM_DISTRIBUTION=block:block	SLURM_DISTRIBUTION=block:cyclic
OPENMPI^[2]	SLURM_DISTRIBUTION=block:block	SLURM_DISTRIBUTION=block:cyclic

Summary

The above tables list the most commonly used settings for process/thread placement. Some compilers and Intel libraries may have additional options for process and thread placement beyond those mentioned on this page. For more information on a specific compiler/library, check the more detailed documentation for that library.

Using HBM

326 dense compute nodes are available with 512 GB of DDR memory and 128 GB of High Bandwidth memory (HBM). Memory-bound application in particular are expected to benefit from the use of HBM but other codes may also show some benefits by using HBM.

All nodes in the cpu partition have the HBM configured in flat mode, meaning that HBM is visible to your application as addessable memory. By default, your code will use DDR memory only. To enable your application to use HBM memory, first load the numactl/2.0.18 module and then prepend the appropriate numactl command to your run command as shown in the table below.

Execution Model	DDR	HBM
Serial	./a.out	numactl --preferred-many=8-15 ./a.out
MPI	srun ./a.out	srun numactl --preferred-many=8-15 ./a.out

Please visit our HBM documentation for more information.

GPU Programming

132 NVIDIA H100 GPUs are available on Cardinal. Please visit our GPU documentation.

Reference

Supercomputer:

Cardinal

Fields of Science:

Programming

Search form

Cardinal Programming Environment

Compilers

Parallel Programming

MPI

OpenMP

Hybrid (MPI + OpenMP)

Tuning Parallel Program Performance: Process/Thread Placement

OpenMP only

MPI Only

Hybrid (MPI + OpenMP)

Summary

Using HBM

GPU Programming

Reference

Client Resources

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

Cardinal Programming Environment

Compilers

Parallel Programming

MPI

OpenMP

Hybrid (MPI + OpenMP)

Tuning Parallel Program Performance: Process/Thread Placement

OpenMP only

MPI Only

Hybrid (MPI + OpenMP)

Summary

Using HBM

GPU Programming

Reference

Client Resources

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links