MVAPICH2 and/or STAR-CCM+ MPI job failure and workaround

Category: 
Resolution: 
Resolved
Affected Software: 

Cardinal's NDR (Next Data Rate) hardware is not supported by some older software.This can cause problems for application software built with MVAPICH2.  Typical symptoms are batch output messages like this:

[c0128.ten.osc.edu:mpi_rank_188][get_link_speed] Invalid link speed 128 

In this case we recommend that users switch to MVAPICH3. 

STAR-CCM+ encounters errors when running MPI jobs with Intel MPI or OpenMPI, displaying the following message:

ib_iface.c:1139 UCX ERROR Invalid active_speed on mlx5_0:1: 128

This issue occurs because the UCX library (v1.8) bundled with STAR-CCM+ only supports Mellanox InfiniBand EDR, while Mellanox InfiniBand NDR is used on Cardinal. As a result, STAR-CCM+ fails to correctly communicate over the newer fabric.

Affected versions

18.18.06.006, 19.04.009 and possibly later versions

Workaround

The solution for STAR-CCM+ is to bypass the UCX library for MPI communication by configuring the environment variables appropriately:

For Intel MPI:

export FI_PROVIDER="verbs"

For OpenMPI:

export OMPI_MCA_btl_openib_allow_ib=1

Set these variables before executing the starccm+ command.

For MVAPICH2 built user software, the workaround is to switch to MVAPICH3 and rebuild:

module load mvapich/3.0