OpenMPI 4 and NVHPC MPI Compatibility Issues with SLURM HWLOC
A pure MPI application using mpirun
or mpiexec
with more ranks than the number of NUMA nodes may encounter an error similar to the following:
A pure MPI application using mpirun
or mpiexec
with more ranks than the number of NUMA nodes may encounter an error similar to the following:
You may encounter the following error while running mpp-dyna jobs with multiple nodes:
[c0054:22206:0:22206] ib_mlx5_log.c:179 Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0) [c0054:22206:0:22206] ib_mlx5_log.c:179 RC QP 0xef8 wqe[365]: RDMA_READ s-- [rva 0x32a5cb38 rkey 0x20000] [va 0x319d3bf0 len 10200 lkey 0x2e5f98] [rqpn 0xfb8 dlid=2285 sl=0 port=1 src_path_bits=0] forrtl: error (76): Abort trap signal
Unknown
mpp-dyna versions 11, 13, when running on multiple nodes
Users may encounter the following errors when compiling a C++ program with GCC 13:
error: 'uint64_t' in namespace 'std' does not name a type
or
Several applications using OpenMPI, including HDF5, Boost, Rmpi, ORCA, and CP2K, may fail with errors such as
mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed
or
Caught signal 11: segmentation fault
We have identified that the issue is related to HCOLL (Hierarchical Collectives) being enabled in OpenMPI.
STAR-CCM+ encounters errors when running MPI jobs with Intel MPI or OpenMPI, displaying the following message:
ib_iface.c:1139 UCX ERROR Invalid active_speed on mlx5_0:1: 128
This issue occurs because the UCX library (v1.8) bundled with STAR-CCM+ only supports Mellanox InfiniBand EDR, while Mellanox InfiniBand NDR is used on Cardinal. As a result, STAR-CCM+ fails to correctly communicate over the newer fabric.
18.18.06.006, 19.04.009 and possibly later versions
If you are getting an error:
UnavailableInvalidChannel: HTTP 403 FORBIDDEN for channel intel <https://conda.anaconda.org/intel>
while creating a python environment or installing python packages, you can solve it by running the command
conda config --remove channels intel
.
If you would like to use intel hosted packages for python environment, you can access them by running the following command
The newly released version of NumPy 2.0 includes substantial internal changes, including migrating code from C to C++. These modifications have led to significant issues with backwards compatibility, resulting in numerous breaking changes to both the Python and C APIs. As a consequence, packages built against NumPy 1.xx may encounter ImportError messages. To ensure compatibility, these packages must be rebuilt against NumPy 2.0.
Recommendation for Addressing the Issue:
You may experience a multi-node job hang if the job runs into a module that requires heavy I/O, e.g., MP2 or CCSD. Additionally, it potentially leads to our GPFS performance issue. We have identified the issue as related to the MPI I/O issue of OpenMPI 4.1. To remedy this, we will take the following procedures:
There are changes on MPI libraries on Pitzer after May 20. We will upgrade MOFED from 4.9 to 5.6 and recompile all OpenMPI and Mvapich2 against the newer MOFED version. Users with their own MPI libraries may see job failures and will need to rebuild their applications linked against the MPI libraries.