Slurm Migration

Overview

Slurm, which stands for Simple Linux Utility for Resource Management, is a widely used open-source HPC resource management and scheduling system that originated at Lawrence Livermore National Laboratory.

It is decided that OSC will be implementing Slurm for job scheduling and resource management, to replace the Torque resource manager and Moab scheduling system that it currently uses, over the course of 2020.

Phases of Slurm Migration

It is expected that on Jan 1, 2021, both Pitzer and Owens clusters will be using Slurm. OSC will be switching to Slurm on Pitzer with the deployment of the new Pitzer hardware in September 2020. Owens migration to Slurm will occur later this fall.

PBS Compatibility Layer

During Slurm migration, OSC enables PBS compatibility layer provided by Slurm in order to make the transition as smooth as possible. Therefore, PBS batch scripts that used to work in the previous Torque/Moab environment mostly still work in Slurm. However, we encourage you to start to convert your PBS batch scripts to Slurm scripts because

PBS compatibility layer usually handles basic cases, and may not be able to handle some complicated cases
Slurm has many features that are not available in Moab/Torque, and the layer will not provide access to those features
OSC may turn off the PBS compatibility layer in the future

Please check the following pages on how to submit a Slurm job:

How to prepare Slurm job scripts
How to submit, monitor and manage jobs
Step-by-step instructions on how to submit jobs
Slurm migration issues
Slides for Sept 23, 2020 Workshop

How to Prepare Slurm Job Scripts

Known Issue

The usage of combing the --ntasks and --ntask-per-node options in a job script can cause some unexpected resource allocations and placement due to a bug in Slurm 23. OSC users are strongly encouraged to review their job scripts for jobs that request both --ntasks and --ntasks-per-node. Jobs should request either --ntasks or --ntasks-per-node, not both.

As the first step, you can submit your PBS batch script as you did before to see whether it works or not. If it does not work, you can either follow this page for step-by-step instructions, or read the tables below to convert your PBS script to Slurm script by yourself. Once the job script is prepared, you can refer to this page to submit and manage your jobs.

Job Submission Options

Use	Torque/Moab	Slurm Equivalent
Script directive	`#PBS`	`#SBATCH`
Job name	`-N <name>`	`--job-name=<name>`
Project account	`-A <account>`	`--account=<account>`
Queue or partition	`-q queuename`	`--partition=queuename`
Wall time limit	`-l walltime=hh:mm:ss`	`--time=hh:mm:ss`
Node count	`-l nodes=N`	`--nodes=N`
Process count per node	`-l ppn=M`	`--ntasks-per-node=M`
Memory limit	`-l mem=Xgb`	`--mem=Xgb` (it is MB by default)
Request GPUs	`-l nodes=N:ppn=M:gpus=G`	`--nodes=N --ntasks-per-node=M --gpus-per-node=G`
Request GPUs in default mode	`-l nodes=N:ppn=M:gpus=G:default`	`--nodes=N --ntasks-per-node=M --gpus-per-node=G --gpu_cmode=shared`
Require pfsdir	`-l nodes=N:ppn=M:pfsdir`	`--nodes=N --ntasks-per-node=M --gres=pfsdir`
Require 'vis'	`-l nodes=N:ppn=M:gpus=G:vis`	`--nodes=N --ntasks-per-node=M --gpus-per-node=G --gres=vis`
Require special property	`-l nodes=N:ppn=M:property`	`--nodes=N --ntasks-per-node=M --constraint=property`
Job array	`-t <array indexes>`	`--array=<indexes>`
Standard output file	`-o <file path>`	`--output=<file path>/<file name> (path must exist, and you must specify the name of the file)`
Standard error file	`-e <file path>`	`--error=<file path>/<file name> (path must exist, and you must specify the name of the file)`
Job dependency	`-W depend=after:jobID[:jobID...]` `-W depend=afterok:jobID[:jobID...]` `-W depend=afternotok:jobID[:jobID...]` `-W depend=afterany:jobID[:jobID...]`	`--dependency=after:jobID[:jobID...]` `--dependency=afterok:jobID[:jobID...]` `--dependency=afternotok:jobID[:jobID...]` `--dependency=afterany:jobID[:jobID...]`
Request event notification	`-m <events>`	`--mail-type=<events>` `Note: multiple mail-type requests may be specified in a comma-separated list:` `--mail-type=BEGIN,END,NONE,FAIL`
Email address	`-M <email address>`	`--mail-user=<email address>`
Software flag	`-l software=pkg1+1%pkg2+4`	`--licenses=pkg1@osc:1,pkg2@osc:4`
Require reservation	`-l advres=rsvid`	`--reservation=rsvid`

Job Environment Variables

Info	Torque/Moab Environment Variable	Slurm Equivalent
Job ID	`$PBS_JOBID`	`$SLURM_JOB_ID`
Job name	`$PBS_JOBNAME`	`$SLURM_JOB_NAME`
Queue name	`$PBS_QUEUE`	`$SLURM_JOB_PARTITION`
Submit directory	`$PBS_O_WORKDIR`	`$SLURM_SUBMIT_DIR`
Node file	`cat $PBS_NODEFILE`	`srun hostname \|sort -n`
Number of processes	`$PBS_NP`	`$SLURM_NTASKS`
Number of nodes allocated	`$PBS_NUM_NODES`	`$SLURM_JOB_NUM_NODES`
Number of processes per node	`$PBS_NUM_PPN`	`$SLURM_TASKS_PER_NODE`
Walltime	`$PBS_WALLTIME`	`$SLURM_TIME_LIMIT`
Job array ID	`$PBS_ARRAYID`	`$SLURM_ARRAY_JOB_ID`
Job array index	`$PBS_ARRAY_INDEX`	`$SLURM_ARRAY_TASK_ID`

Environment Variables Specific to OSC

Environment variable	Description
`$TMPDIR`	Path to a node-specific temporary directory (/tmp) for a given job
`$PFSDIR`	Path to the scratch storage; only present if --gres request includes pfsdir.
`$SLURM_GPUS_ON_NODE`	Number of GPUs allocated to the job on each node (works with --exclusive jobs)
`$SLURM_JOB_GRES`	The job's GRES request
`$SLURM_JOB_CONSTRAINT`	The job's constraint request
`$SLURM_TIME_LIMIT`	Job walltime in seconds

Commands in a Batch Job

Use Torque/Moab Environment Variable Slurm Equivalent

Launch a parallel program inside a job mpiexec <args> srun <args>

Scatter a file to node-local file systems

Use	Torque/Moab Environment Variable	Slurm Equivalent
Launch a parallel program inside a job	`mpiexec <args>`	`srun <args>`
Scatter a file to node-local file systems	`pbsdcp <file> <nodelocaldir>`	`sbcast <src_file> <nodelocaldir>/<dest_file>` * Note: sbcast does not have a recursive cast option, meaning you can't use `sbcast -r` to scatter multiple files in a directory. Instead, you may use a loop command similar to this: `cd ${the directory that has the files}` `for FILE in *` `do` `sbcast -p $FILE $TMPDIR/some_directory/$FILE` `done`
Gather node-local files to a shared file system	`pbsdcp -g <file> <shareddir>`	`sgather <src_file> <shareddir>/<dest_file> sgather -r <src_dir> <sharedir>/dest_dir>`

pbsdcp <file> <nodelocaldir>

sbcast <src_file> <nodelocaldir>/<dest_file>

* Note: sbcast does not have a recursive cast option, meaning you can't use sbcast -r to scatter multiple files in a directory. Instead, you may use a loop command similar to this:

cd ${the directory that has the files}

for FILE in * 
do
    sbcast -p $FILE $TMPDIR/some_directory/$FILE
done

Gather node-local files to a shared file system

pbsdcp -g <file> <shareddir>

sgather <src_file> <shareddir>/<dest_file> sgather -r <src_dir> <sharedir>/dest_dir>

Supercomputer:

Owens

Pitzer

How to Submit, Monitor and Manage Jobs

Submit Jobs

Use Torque/Moab Command Slurm Equivalent

Submit batch job qsub <jobscript> sbatch <jobscript>

Submit interactive job

Use	Torque/Moab Command	Slurm Equivalent
Submit batch job	`qsub <jobscript>`	`sbatch <jobscript>`
Submit interactive job	`qsub -I [options]`	`sinteractive [options]` `salloc [options]`

qsub -I [options]

sinteractive [options]

salloc [options]

Notice: If a node fails, then the running job will be automatically resubmitted to the queue and will only be charged for the resubmission time and not the failed time.
One can use --mail-type=ALL option in their script to receive notifications about their jobs. Please see the slurm sbatch man page for more information.
Another option is to disable the resubmission using --no-requeue so that the job does get submitted on node failure.
A final note is that if the job does not get requeued after a failure, then there will be a charged incurred for the time that the job ran before it failed.

Interactive jobs

Submitting interactive jobs is a bit different in Slurm. When the job is ready, one is logged into the login node they submitted the job from. From there, one can then login to one of the reserved nodes.

You can use the custom tool sinteractive as:

[xwang@pitzer-login04 ~]$ sinteractive
salloc: Pending job allocation 14269
salloc: job 14269 queued and waiting for resources
salloc: job 14269 has been allocated resources
salloc: Granted job allocation 14269
salloc: Waiting for resource configuration
salloc: Nodes p0591 are ready for job
...
...
[xwang@p0593 ~] $
# can now start executing commands interactively

Or, you can use salloc as:

[user@pitzer-login04 ~] $ salloc -t 00:05:00 --ntasks-per-node=3
salloc: Pending job allocation 14209
salloc: job 14209 queued and waiting for resources
salloc: job 14209 has been allocated resources
salloc: Granted job allocation 14209
salloc: Waiting for resource configuration
salloc: Nodes p0593 are ready for job

# normal login display
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
14210 serial-48     bash     usee  R       0:06      1 p0593
[user@pitzer-login04 ~]$ srun --jobid=14210 --pty /bin/bash
# normal login display
[user@p0593 ~] $
# can now start executing commands interactively

Manage Jobs

Use	Torque/Moab Command	Slurm Equivalent
Delete a job*	`qdel <jobid>`	`scancel <jobid>`
Hold a job	`qhold <jobid>`	`scontrol hold <jobid>`
Release a job	`qrls <jobid>`	`scontrol release <jobid>`

* User is eligible to delete his own jobs. PI/project admin is eligible to delete jobs submitted to the project he is an admin on.

Monitor Jobs

Use	Torque/Moab Command	Slurm Equivalent
Job list summary	`qstat` or `showq`	`squeue`
Detailed job information	`qstat -f <jobid>` or `checkjob <jobid>`	`sstat -a <jobid>` or `scontrol show job <jobid>`
Job information by a user	`qstat -u <user>`	`squeue -u <user>`
View job script (system admin only)	`js <jobid>`	`jobscript <jobid>`
Show expected start time	`showstart <job ID>`	`squeue --start --jobs=<jobid>`

Supercomputer:

Ascend

Cardinal

Pitzer

Steps on How to Submit Jobs

How to Submit Interactive jobs

There are different ways to submit interactive jobs.

Using `sinteractive`

You can use the custom tool sinteractive as:

[xwang@pitzer-login04 ~]$ sinteractive
salloc: Pending job allocation 14269
salloc: job 14269 queued and waiting for resources
salloc: job 14269 has been allocated resources
salloc: Granted job allocation 14269
salloc: Waiting for resource configuration
salloc: Nodes p0591 are ready for job
...
...
[xwang@p0593 ~] $
# can now start executing commands interactively

Using `salloc`

It is a little complicated if you use salloc . Below is a simple example:

[user@pitzer-login04 ~] $ salloc -t 00:30:00 --ntasks-per-node=3 srun --pty /bin/bash
salloc: Pending job allocation 2337639
salloc: job 2337639 queued and waiting for resources
salloc: job 2337639 has been allocated resources
salloc: Granted job allocation 2337639
salloc: Waiting for resource configuration
salloc: Nodes p0002 are ready for job

# normal login display
[user@p0002 ~]$
# can now start executing commands interactively

How to Submit Non-interactive jobs

Submit Slurm job Script

A job can be submitted non-interactively via a Slurm job script. Below is a simple Slurm job script slurm_job.sh that calls for a parallel run:

#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --job-name=hello
#SBATCH --account=PZS0712

cd $SLURM_SUBMIT_DIR
module load intel
mpicc -O2 hello.c -o hello
srun ./hello > hello_results

Submit this script using the command sbatch slurm_job.sh , and this job is scheduled successfully as shown below:

[xwang@cardinal-login04 slurm]$ sbatch slurm_job.sh
Submitted batch job 421618

Check the Job

You can use the jobscript command to check the job information:

[xwang@cardinal-login04 slurm]$ jobscript 421618
----- BEGIN jobid=421618 workdir=/users/oscgen/xwang/slurm -----
#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --job-name=hello
#SBATCH --account=PZS0712 

cd $SLURM_SUBMIT_DIR 
module load intel 
mpicc -O2 hello.c -o hello 
srun ./hello > hello_results

----- END jobid=421618 workdir=/users/oscgen/xwang/slurm -----

Supercomputer:

Ascend

Cardinal

Pitzer

Slurm Migration Issues

This page documents the known issues for migrating jobs from Torque to Slurm.

$PBS_NODEFILE and $SLURM_JOB_NODELIST

Please be aware that $PBS_NODEFILE is a file while $SLURM_JOB_NODELIST is a string variable.

The analog on Slurm to cat $PBS_NODEFILE is srun hostname | sort -n

Environment variables are not evaluated in job script directives

Environment variables do not work in a slurm directive inside a job script.

The job script job.txt including #SBATCH --output=$HOME/jobtest.out won't work in Slurm. Please use the following instead:

sbatch --output=$HOME/jobtest.out job.txt

Using mpiexec with Intel MPI

Intel MPI (all versions through 2019.x) is configured to support PMI and Hydra process managers. It is recommended to use srun as the MPI program launcher. This is a possible symptom of using mpiexec/mpirun:

srun: error: PMK_KVS_Barrier duplicate request from task 0

as well as:

MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found

If you prefer using mpiexec/mpirun with SLURM, please add the following code to the batch script before running any MPI executable:

unset I_MPI_PMI_LIBRARY 
export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0   # the option -ppn only works if you set this before

Executables with a certain MPI library using SLURM PMI2 interface

e.g.

Stopping mpi4py python processes during an interactive job session only from a login node:

$ salloc -t 15:00 --ntasks-per-node=4
salloc: Pending job allocation 20822
salloc: job 20822 queued and waiting for resources
salloc: job 20822 has been allocated resources
salloc: Granted job allocation 20822
salloc: Waiting for resource configuration
salloc: Nodes p0511 are ready for job
# don't login to one of the allocated nodes, stay on the login node
$ module load python/3.7-2019.10
$ source activate testing
(testing) $ srun --quit-on-interrupt python mpi4py-test.py
# enter <ctrl-c>
^Csrun: sending Ctrl-C to job 20822.5
Hello World (from process 0)
process 0 is sleeping...
Hello World (from process 2)
process 2 is sleeping...
Hello World (from process 3)
process 3 is sleeping...
Hello World (from process 1)
process 1 is sleeping...
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
Traceback (most recent call last):
File "mpi4py-test.py", line 16, in <module>
time.sleep(15)
KeyboardInterrupt
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 20822.5 ON p0511 CANCELLED AT 2020-09-04T10:13:44 ***
# still in the job and able to restart the processes
(testing)

pbsdcp with Slurm

pbsdcp with gather option sometimes does not work correctly. It is suggested to use sbcast for scatter option and sgather for gather option instead of pbsdcp. Please be aware that there is no wildcard (*) option for sbcast / sgather . And there is no recursive option for sbcast.In addition, the destination file/directory must exist.

Here are some simple examples:

sbcast <src_file> <nodelocaldir>/<dest_file>
sgather <src_file> <shareddir>/<dest_file>
sgather -r --keep <src_dir> <sharedir>/dest_dir>

Signal handling in slurm

The below script needs to use a wait command for the user-defined signal USR1 to be received by the process.

The sleep process is backgrounded using & wait so that the bash shell can receive signals and execute the trap commands instead of ignoring the signals while the sleep process is running.

#!/bin/bash
#SBATCH --job-name=minimal_trap
#SBATCH --time=2:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --output=%x.%A.log
#SBATCH --signal=B:USR1@60

function my_handler() {
  echo "Catching signal"
  touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal
  exit
}

trap my_handler USR1
trap my_handler TERM

sleep 3600 &
wait

reference: https://bugs.schedmd.com/show_bug.cgi?id=9715

'mail' does not work; use 'sendmail'

The 'mail' does not work in a batch job; use 'sendmail' instead as:

sendmail user@example.com <<EOF
subject: Output path from $SLURM_JOB_ID
from: user@example.com
...
EOF

srun' with no arguments is to allocate a single task when using 'sinteractive'

srun with no arguments is to allocate a single task when using sinteractive to request an interactive job, even you request more than one task. Please pass the needed arguments to srun:

[xwang@owens-login04 ~]$ sinteractive -n 2 -A PZS0712
...
[xwang@o0019 ~]$ srun hostname
o0019.ten.osc.edu
[xwang@o0019 ~]$ srun -n 2 hostname
o0019.ten.osc.edu
o0019.ten.osc.edu

Be careful not to overwrite a Slurm batch output file for a running job

Unlike a PBS batch output file, which lived in a user-non-writeable directory while the job was running, a Slurm batch output file resides under the user's home directory while the job is running. File operations, such as editing and copying, are permitted. Please be careful to avoid such operations while the job is running. In particular, this batch script idiom is no longer correct (e.g., for the default job output file of name $SLURM_SUBMIT_DIR/slurm-jobid.out):

cd $SLURM_SUBMIT_DIR
cp -r * $TMPDIR
cd $TMPDIR
...
cp *.out* $SLURM_SUBMIT_DIR

Please submit any issue using the webform below:

Supercomputer:

Owens

Pitzer

Slurm Migration

Overview

Phases of Slurm Migration

PBS Compatibility Layer

Further Reading

How to Prepare Slurm Job Scripts

Job Submission Options

Job Environment Variables

Environment Variables Specific to OSC

Commands in a Batch Job

How to Submit, Monitor and Manage Jobs

Submit Jobs

Interactive jobs

Manage Jobs

Monitor Jobs

Steps on How to Submit Jobs

How to Submit Interactive jobs

Using sinteractive

Using salloc

How to Submit Non-interactive jobs

Submit Slurm job Script

Check the Job

Slurm Migration Issues

$PBS_NODEFILE and $SLURM_JOB_NODELIST

Environment variables are not evaluated in job script directives

Using mpiexec with Intel MPI

Executables with a certain MPI library using SLURM PMI2 interface

pbsdcp with Slurm

Signal handling in slurm

'mail' does not work; use 'sendmail'

srun' with no arguments is to allocate a single task when using 'sinteractive'

Be careful not to overwrite a Slurm batch output file for a running job

Please submit any issue using the webform below:

Using `sinteractive`

Using `salloc`