Problems with MVAPICH2

Resolution: 
Resolved

Some MVAPICH2 MPI installations on Oakley, Ruby, and Owens, such as the default module mvapich2/2.2 as well as mvapich2/2.1, appear to have a bug that is triggered by certain programs.  The symptoms are 1) the program hangs or 2) the program fails with an error related to Allreduce or Bcast.

To test whether a failure is related to this issue, as opposed to an error in the application software, set the following environment variable in the batch job:  MV2_USE_SLOT_SHMEM_COLL=0  (this option disables optimizations).  If the program runs correctly then the failure is in the MVAPICH2 library.

This issue may affect system installed software, such as lammps/31Mar17, but the occurrence seems to be rare.

There are several workarounds to choose from.

1)  Keep the MV2_USE_SLOT_SHMEM_COLL=0 flag.  This may slow down your code, but it's easy.

2)  Switch to mvapich2/1.9 and rebuild your code.  You'll also have to move to an older compiler.  The easiest way to make the change is with "module load modules/au2014".

3)  If you're using Intel compilers, switch to IntelMPI, "module load intelmpi".  If you use fftw3 and/or scalapack, you should use the MKL versions of these libraries, not the separately loaded modules.  Contact oschelp@osc.edu for assistance.  Some other libraries may not be available.

4)  Switch to OpenMPI.  We have OpenMPI 1.10 installed on Oakley.

The mvapich2 versions with the bug are out-dated, and it is not available on our clusters anymore.