This page documents the known issues for migrating jobs from Torque to Slurm.
$PBS_NODEFILE and $SLURM_JOB_NODELIST
Please be aware that $PBS_NODEFILE
is a file while $SLURM_JOB_NODELIST
is a string variable.
The analog on Slurm to cat $PBS_NODEFILE
is srun hostname | sort -n
Environment variables are not evaluated in job script directives
Environment variables do not work in a slurm directive inside a job script.
The job script job.txt
including #SBATCH --output=$HOME/jobtest.out
won't work in Slurm. Please use the following instead:
sbatch --output=$HOME/jobtest.out job.txt
Using mpiexec with Intel MPI
Intel MPI (all versions through 2019.x) is configured to support PMI and Hydra process managers. It is recommended to use srun
as the MPI program launcher. This is a possible symptom of using mpiexec
/mpirun
:
as well as:
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
If you prefer using mpiexec
/mpirun
with SLURM, please add the following code to the batch script before running any MPI executable:
unset I_MPI_PMI_LIBRARY export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0 # the option -ppn only works if you set this before
Executables with a certain MPI library using SLURM PMI2 interface
e.g.
Stopping mpi4py python processes during an interactive job session only from a login node:
pbsdcp with Slurm
pbsdcp
with gather option sometimes does not work correctly. It is suggested to use sbcast
for scatter option and sgather
for gather option instead of pbsdcp
. Please be aware that there is no wildcard (*) option for sbcast
/ sgather
. And there is no recursive option for sbcast
.In addition, the destination file/directory must exist.
Here are some simple examples:
sbcast <src_file> <nodelocaldir>/<dest_file> sgather <src_file> <shareddir>/<dest_file> sgather -r --keep <src_dir> <sharedir>/dest_dir>
Signal handling in slurm
The below script needs to use a wait command for the user-defined signal USR1 to be received by the process.
The sleep process is backgrounded using & wait
so that the bash shell can receive signals and execute the trap commands instead of ignoring the signals while the sleep process is running.
#!/bin/bash #SBATCH --job-name=minimal_trap #SBATCH --time=2:00 #SBATCH --nodes=1 --ntasks-per-node=1 #SBATCH --output=%x.%A.log #SBATCH --signal=B:USR1@60 function my_handler() { echo "Catching signal" touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal exit } trap my_handler USR1 trap my_handler TERM sleep 3600 & wait
reference: https://bugs.schedmd.com/show_bug.cgi?id=9715
'mail' does not work; use 'sendmail'
The 'mail' does not work in a batch job; use 'sendmail' instead as:
sendmail user@example.com <<EOF subject: Output path from $SLURM_JOB_ID from: user@example.com ... EOF
srun' with no arguments is to allocate a single task when using 'sinteractive'
srun
with no arguments is to allocate a single task when using sinteractive
to request an interactive job, even you request more than one task. Please pass the needed arguments to srun
:
[xwang@owens-login04 ~]$ sinteractive -n 2 -A PZS0712 ... [xwang@o0019 ~]$ srun hostname o0019.ten.osc.edu [xwang@o0019 ~]$ srun -n 2 hostname o0019.ten.osc.edu o0019.ten.osc.edu
Be careful not to overwrite a Slurm batch output file for a running job
Unlike a PBS batch output file, which lived in a user-non-writeable directory while the job was running, a Slurm batch output file resides under the user's home directory while the job is running. File operations, such as editing and copying, are permitted. Please be careful to avoid such operations while the job is running. In particular, this batch script idiom is no longer correct (e.g., for the default job output file of name $SLURM_SUBMIT_DIR/slurm-jobid.out):
Please submit any issue using the webform below: