Software

CP package hang with MVAPICH3 in Quantum-Espresso

While using MVAPICH3 builds of Quantum ESPRESSO (QE), users may encounter hangs when running the CP package, which can lead to job timeouts. We recommend switching to the OpenMPI build of any QE version.

Please note that MVAPICH3 variants will be deprecated on July 21, 2025.

Workaround

Please switch to Intel-OpenMPI version. You can access it via

module load intel/2021.10.0 openmpi/5.0.2
module load quantum-espresso/7.3.1

xfdtd could not find the command lsb_release

You may encounter the following error while running xfdtd on Cardinal:

Could not find the command 'lsb_release'.  Please install the system package that has this command.
On RHEL7/CentOS7 the package is redhat-lsb-core. On Ubuntu18 the package is lsb-core.

Cause of the Error

Cardinal runs RHEL9 which no longer provides lsb_release by default

Affected versions

7.10.2.3

Workaround

No known workaround

Some MKL environment variables have incorrect paths

MKL module files define some helper environment variables with incorrect paths. This can yield link time errors. All three clusters are affected. We are working to correct the module files. A workaround for users is to redefine the environment variable with the correct path; this requires some computational maturity. We recommend users contact oschelp@osc.edu for assistance. An example error from Cardinal with module intel-oneapi-mkl/2023.2.0 that defined environment variable MKL_LIBS_INT64 follows:

Resolved: Home directory space Issue with MATLAB 2024a

Users may experience their home directory running out of space after executing multiple MATLAB 2024a jobs. This issue is caused by the accumulation of multiple copies of the MathWorks Service Host in $HOME/.MathWorks/ServiceHost.

To address this, we have upgraded MATLAB 2024a to Update 7 on all clusters, as recommended in the following article: Why is the MathWorks Service Host causing issues with my cluster and/or HPC?

NCCL hang on Ascend dual-GPU nodes

Users may encounter the following message and experience NCCL hangs if the first operation is a barrier when running multi-GPU training. We have identified that this issue occurs only on a single Ascend Next Gen (dual-GPU) node where the GPUs are connected via the SMP interconnect across NUMA nodes, rather than through NVLink.

Python version mismatch in Jupyter + Spark instance

You may encounter the following error message when running a Spark instance using a custom kernel in the Jupyter + Spark app:

STAR error bgzf_open: Assertion failed

You may encounter errors that look similar to these when running STAR 2.7.10b:

STAR: bgzf.c:158: bgzf_open: Assertion `compressBound(0xff00) < 0x10000' failed.

Cause

It seems to be related to this issue: https://github.com/alexdobin/STAR/issues/2063

STAR bundles an older version of HTSlib which is incompatible with zlib-ng, a library we build STAR with

Workaround

Use star/2.7.11b

WARN SparkSession in Jupyter + Spark instance

You may encounter the following warning message when running a Spark instance using the default PySpark kernel in a Jupyter + Spark application:

Singularity: failed to run a container directly or pull an image from Singularity or Docker hub

You might encounter an error while run a container directly from a hub:

[pitzer-login01]$ apptainer run shub://vsoch/hello-world
Progress |===================================| 100.0%
FATAL: container creation failed: mount error: can't mount image /proc/self/fd/13: failed to find loop device: could not attach image file too loop device: No loop devices available

One solution is to remove the Singularity cached images from local cache directory $HOME/.apptainer/cache.

Singularity: failed to pull a large Docker image

You might encounter an error while pulling a large Docker image:

[pitzer-login01]$ apptainer pull docker://qimme2/core
FATAL: Unable to pull docker://qiime2/core While running mksquashfs: signal: killed

The process could be killed because the image is cached in the home directory which is a slower file system or the image size might exceed a single file size limit.

The solution is to use other file systems like /fs/ess/scratch and $TMPDIR for caches and temp files to build the squashfs filesystem:

Search form

Software

CP package hang with MVAPICH3 in Quantum-Espresso

Workaround

xfdtd could not find the command lsb_release

Cause of the Error

Affected versions

Workaround

Some MKL environment variables have incorrect paths

Resolved: Home directory space Issue with MATLAB 2024a

NCCL hang on Ascend dual-GPU nodes

Python version mismatch in Jupyter + Spark instance

STAR error bgzf_open: Assertion failed

Cause

Workaround

WARN SparkSession in Jupyter + Spark instance

Singularity: failed to run a container directly or pull an image from Singularity or Docker hub

Singularity: failed to pull a large Docker image

Pages

Upcoming Events

Recent News

Translate

Ohio Department of Higher Education

State Government Links

Education Links

Search form

You are here

Software

Workaround

Cause of the Error

Affected versions

Workaround

Cause

Workaround

Pages

Upcoming Events

Recent News

Translate

State Government Links

Education Links