HCOLL-related failures in OpenMPI applications

Category: 
Resolution: 
Resolved

Several applications using OpenMPI, including HDF5, Boost, Rmpi, ORCA, and CP2K, may fail with errors such as

mca_coll_hcoll_module_enable() coll_hcol: mca_coll_hcoll_save_coll_handlers failed

or

Caught signal 11: segmentation fault

We have identified that the issue is related to HCOLL (Hierarchical Collectives) being enabled in OpenMPI. 

To address this issue, we have disabled HCOLL by default by setting OMPI_MCA_coll_hcoll_enable=0 in all OpenMPI modules. While disabling HCOLL could impact the performance of MPI applications that rely heavily on collective operations, our tests using osu_alltoall showed no performance degradation.

The root cause of this issue is still unclear, as HCOLL works correctly on other clusters. We are continuing to gather more information to better understand the underlying cause. For now, to ensure stability in the software environment, we will keep HCOLL disabled by default on Cardinal.

Affected versions

4.1.6, 5.0.2 and later versions.