Slurm, which stands for Simple Linux Utility for Resource Management, is a widely used open-source HPC resource management and scheduling system that originated at Lawrence Livermore National Laboratory.
It is decided that OSC will be implementing Slurm for job scheduling and resource management, to replace the Torque resource manager and Moab scheduling system that it currently uses, over the course of 2020.
It is expected that on Jan 1, 2021, both Pitzer and Owens clusters will be using Slurm. OSC will be switching to Slurm on Pitzer with the deployment of the new Pitzer hardware in September 2020. Owens migration to Slurm will occur later this fall.
During Slurm migration, OSC enables PBS compatibility layer provided by Slurm in order to make the transition as smooth as possible. Therefore, PBS batch scripts that used to work in the previous Torque/Moab environment mostly still work in Slurm. However, we encourage you to start to convert your PBS batch scripts to Slurm scripts because
Please check the following pages on how to submit a Slurm job:
--ntasks
and --ntask-per-node
options in a job script can cause some unexpected resource allocations and placement due to a bug in Slurm 23. OSC users are strongly encouraged to review their job scripts for jobs that request both --ntasks
and --ntasks-per-node
. Jobs should request either --ntasks
or --ntasks-per-node
, not both.As the first step, you can submit your PBS batch script as you did before to see whether it works or not. If it does not work, you can either follow this page for step-by-step instructions, or read the tables below to convert your PBS script to Slurm script by yourself. Once the job script is prepared, you can refer to this page to submit and manage your jobs.
Use | Torque/Moab | Slurm Equivalent |
---|---|---|
Script directive | #PBS |
#SBATCH |
Job name | -N <name> |
--job-name=<name> |
Project account | -A <account> |
--account=<account> |
Queue or partition | -q queuename |
--partition=queuename |
Wall time limit |
-l walltime=hh:mm:ss |
--time=hh:mm:ss |
Node count | -l nodes=N |
--nodes=N |
Process count per node | -l ppn=M |
--ntasks-per-node=M |
Memory limit | -l mem=Xgb |
--mem=Xgb (it is MB by default) |
Request GPUs | -l nodes=N:ppn=M:gpus=G |
--nodes=N --ntasks-per-node=M --gpus-per-node=G |
Request GPUs in default mode | -l nodes=N:ppn=M:gpus=G:default |
|
Require pfsdir | -l nodes=N:ppn=M:pfsdir |
--nodes=N --ntasks-per-node=M --gres=pfsdir |
Require 'vis' | -l nodes=N:ppn=M:gpus=G:vis |
--nodes=N --ntasks-per-node=M --gpus-per-node=G --gres=vis |
Require special property |
-l nodes=N:ppn=M:property |
--nodes=N --ntasks-per-node=M --constraint=property |
Job array |
-t <array indexes> |
--array=<indexes> |
Standard output file |
-o <file path> |
--output=<file path>/<file name> (path must exist, and you must specify the name of the file) |
Standard error file |
-e <file path> |
--error=<file path>/<file name> (path must exist, and you must specify the name of the file) |
Job dependency |
-W depend=after:jobID[:jobID...]
|
--dependency=after:jobID[:jobID...]
|
Request event notification | -m <events> |
|
Email address | -M <email address> |
--mail-user=<email address> |
Software flag | -l software=pkg1+1%pkg2+4 |
--licenses=pkg1@osc:1,pkg2@osc:4 |
Require reservation | -l advres=rsvid |
--reservation=rsvid |
Info | Torque/Moab Environment Variable | Slurm Equivalent |
---|---|---|
Job ID | $PBS_JOBID |
$SLURM_JOB_ID |
Job name | $PBS_JOBNAME |
$SLURM_JOB_NAME |
Queue name | $PBS_QUEUE |
$SLURM_JOB_PARTITION |
Submit directory | $PBS_O_WORKDIR |
$SLURM_SUBMIT_DIR |
Node file | cat $PBS_NODEFILE |
srun hostname |sort -n |
Number of processes | $PBS_NP |
$SLURM_NTASKS |
Number of nodes allocated | $PBS_NUM_NODES |
$SLURM_JOB_NUM_NODES |
Number of processes per node | $PBS_NUM_PPN |
$SLURM_TASKS_PER_NODE |
Walltime | $PBS_WALLTIME |
$SLURM_TIME_LIMIT |
Job array ID | $PBS_ARRAYID |
$SLURM_ARRAY_JOB_ID |
Job array index | $PBS_ARRAY_INDEX |
$SLURM_ARRAY_TASK_ID |
Environment variable | Description |
---|---|
$TMPDIR |
Path to a node-specific temporary directory (/tmp) for a given job |
$PFSDIR |
Path to the scratch storage; only present if --gres request includes pfsdir. |
$SLURM_GPUS_ON_NODE |
Number of GPUs allocated to the job on each node (works with --exclusive jobs) |
$SLURM_JOB_GRES |
The job's GRES request |
$SLURM_JOB_CONSTRAINT |
The job's constraint request |
$SLURM_TIME_LIMIT |
Job walltime in seconds |
Use | Torque/Moab Environment Variable | Slurm Equivalent |
---|---|---|
Launch a parallel program inside a job | mpiexec <args> |
srun <args> |
Scatter a file to node-local file systems | pbsdcp <file> <nodelocaldir> |
* Note: sbcast does not have a recursive cast option, meaning you can't use
|
Gather node-local files to a shared file system | pbsdcp -g <file> <shareddir> |
|
Use | Torque/Moab Command | Slurm Equivalent |
---|---|---|
Submit batch job | qsub <jobscript> |
sbatch <jobscript> |
Submit interactive job | qsub -I [options] |
|
--mail-type=ALL
option in their script to receive notifications about their jobs. Please see the slurm sbatch man page for more information.--no-requeue
so that the job does get submitted on node failure.Submitting interactive jobs is a bit different in Slurm. When the job is ready, one is logged into the login node they submitted the job from. From there, one can then login to one of the reserved nodes.
You can use the custom tool sinteractive
as:
[xwang@pitzer-login04 ~]$ sinteractive salloc: Pending job allocation 14269 salloc: job 14269 queued and waiting for resources salloc: job 14269 has been allocated resources salloc: Granted job allocation 14269 salloc: Waiting for resource configuration salloc: Nodes p0591 are ready for job ... ... [xwang@p0593 ~] $ # can now start executing commands interactively
Or, you can use salloc
as:
[user@pitzer-login04 ~] $ salloc -t 00:05:00 --ntasks-per-node=3 salloc: Pending job allocation 14209 salloc: job 14209 queued and waiting for resources salloc: job 14209 has been allocated resources salloc: Granted job allocation 14209 salloc: Waiting for resource configuration salloc: Nodes p0593 are ready for job # normal login display $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 14210 serial-48 bash usee R 0:06 1 p0593 [user@pitzer-login04 ~]$ srun --jobid=14210 --pty /bin/bash # normal login display [user@p0593 ~] $ # can now start executing commands interactively
Use | Torque/Moab Command | Slurm Equivalent |
---|---|---|
Delete a job* | qdel <jobid> |
scancel <jobid> |
Hold a job | qhold <jobid> |
scontrol hold <jobid> |
Release a job | qrls <jobid> |
scontrol release <jobid> |
* User is eligible to delete his own jobs. PI/project admin is eligible to delete jobs submitted to the project he is an admin on.
Use | Torque/Moab Command | Slurm Equivalent |
---|---|---|
Job list summary | qstat or showq |
squeue |
Detailed job information | qstat -f <jobid> or checkjob <jobid> |
sstat -a <jobid> or scontrol show job <jobid> |
Job information by a user | qstat -u <user> |
squeue -u <user> |
View job script (system admin only) |
js <jobid> |
jobscript <jobid> |
Show expected start time | showstart <job ID> |
|
There are different ways to submit interactive jobs.
qsub
qsub
command is patched locally to handle the interactive jobs. So mostly you can use the qsub
command as before:
[xwang@pitzer-login04 ~]$ qsub -I -l nodes=1 -A PZS0712 salloc: Pending job allocation 15387 salloc: job 15387 queued and waiting for resources salloc: job 15387 has been allocated resources salloc: Granted job allocation 15387 salloc: Waiting for resource configuration salloc: Nodes p0601 are ready for job ... [xwang@p0601 ~]$ # can now start executing commands interactively
sinteractive
You can use the custom tool sinteractive
as:
[xwang@pitzer-login04 ~]$ sinteractive salloc: Pending job allocation 14269 salloc: job 14269 queued and waiting for resources salloc: job 14269 has been allocated resources salloc: Granted job allocation 14269 salloc: Waiting for resource configuration salloc: Nodes p0591 are ready for job ... ... [xwang@p0593 ~] $ # can now start executing commands interactively
salloc
It is a little complicated if you use salloc
. Below is a simple example:
[user@pitzer-login04 ~] $ salloc -t 00:30:00 --ntasks-per-node=3 srun --pty /bin/bash salloc: Pending job allocation 2337639 salloc: job 2337639 queued and waiting for resources salloc: job 2337639 has been allocated resources salloc: Granted job allocation 2337639 salloc: Waiting for resource configuration salloc: Nodes p0002 are ready for job # normal login display [user@p0002 ~]$ # can now start executing commands interactively
Since we have the compatibility layer installed, your current PBS scripts may still work as they are, so you should test them and see if they submit and run successfully. Submit your PBS batch script as you did before to see whether it works or not. Below is a simple PBS job script pbs_job.txt
that calls for a parallel run:
#PBS -l walltime=1:00:00 #PBS -l nodes=2:ppn=40 #PBS -N hello #PBS -A PZS0712 cd $PBS_O_WORKDIR module load intel mpicc -O2 hello.c -o hello mpiexec ./hello > hello_results
Submit this script on Pitzer using the command qsub pbs_job.txt
, and this job is scheduled successfully as shown below:
[xwang@pitzer-login04 slurm]$ qsub pbs_job.txt 14177
You can use the jobscript
command to check the job information:
[xwang@pitzer-login04 slurm]$ jobscript 14177 -------------------- BEGIN jobid=14177 -------------------- #!/bin/bash #PBS -l walltime=1:00:00 #PBS -l nodes=2:ppn=40 #PBS -N hello #PBS -A PZS0712 cd $PBS_O_WORKDIR module load intel mpicc -O2 hello.c -o hello mpiexec ./hello > hello_results -------------------- END jobid=14177 --------------------
#!/bin/bash
added at the beginning of the job script from the output. This line is added by Slurm's qsub compatibility script because Slurm job scripts must have #!<SHELL>
as its first line.You will get this message explicitly if you submit the script using the command sbatch pbs_job.txt
[xwang@pitzer-login04 slurm]$ sbatch pbs_job.txt sbatch: WARNING: Job script lacks first line beginning with #! shell. Injecting '#!/bin/bash' as first line of job script. Submitted batch job 14180
An alternative way is that we convert the PBS job script (pbs_job.txt
) to Slurm script (slurm_job.txt
) before submitting the job. The table below shows the comparisons between the two scripts (see this page for more information on the job submission options):
Explanations | Torque | Slurm |
---|---|---|
Line that specifies the shell | No need |
#!/bin/bash |
Resource specification
|
#PBS -l walltime=1:00:00 #PBS -l nodes=2:ppn=40 #PBS -N hello #PBS -A PZS0712 |
#SBATCH --time=1:00:00 #SBATCH --nodes=2 --ntasks-per-node=40 #SBATCH --job-name=hello #SBATCH --account=PZS0712 |
Variables, paths, and modules |
cd $PBS_O_WORKDIR module load intel |
cd $SLURM_SUBMIT_DIR module load intel |
Launch and run application |
mpicc -O2 hello.c -o hello mpiexec ./hello > hello_results |
mpicc -O2 hello.c -o hello srun ./hello > hello_results |
cd $SLURM_SUBMIT_DIR
can be omitted in the Slurm script because your Slurm job always starts in your submission directory, which is different from Torque/Moab environment where your job always starts in your home directory.Once the script is ready, you submit the script using the command sbatch slurm_job.txt
[xwang@pitzer-login04 slurm]$ sbatch slurm_job.txt
Submitted batch job 14215
This page documents the known issues for migrating jobs from Torque to Slurm.
Please be aware that $PBS_NODEFILE
is a file while $SLURM_JOB_NODELIST
is a string variable.
The analog on Slurm to cat $PBS_NODEFILE
is srun hostname | sort -n
Environment variables do not work in a slurm directive inside a job script.
The job script job.txt
including #SBATCH --output=$HOME/jobtest.out
won't work in Slurm. Please use the following instead:
sbatch --output=$HOME/jobtest.out job.txt
Intel MPI (all versions through 2019.x) is configured to support PMI and Hydra process managers. It is recommended to use srun
as the MPI program launcher. This is a possible symptom of using mpiexec
/mpirun
:
as well as:
MPI startup(): Warning: I_MPI_PMI_LIBRARY will be ignored since the hydra process manager was found
If you prefer using mpiexec
/mpirun
with SLURM, please add the following code to the batch script before running any MPI executable:
unset I_MPI_PMI_LIBRARY export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0 # the option -ppn only works if you set this before
e.g.
Stopping mpi4py python processes during an interactive job session only from a login node:
pbsdcp
with gather option sometimes does not work correctly. It is suggested to use sbcast
for scatter option and sgather
for gather option instead of pbsdcp
. Please be aware that there is no wildcard (*) option for sbcast
/ sgather
. And there is no recursive option for sbcast
.In addition, the destination file/directory must exist.
Here are some simple examples:
sbcast <src_file> <nodelocaldir>/<dest_file> sgather <src_file> <shareddir>/<dest_file> sgather -r --keep <src_dir> <sharedir>/dest_dir>
The below script needs to use a wait command for the user-defined signal USR1 to be received by the process.
The sleep process is backgrounded using & wait
so that the bash shell can receive signals and execute the trap commands instead of ignoring the signals while the sleep process is running.
#!/bin/bash #SBATCH --job-name=minimal_trap #SBATCH --time=2:00 #SBATCH --nodes=1 --ntasks-per-node=1 #SBATCH --output=%x.%A.log #SBATCH --signal=B:USR1@60 function my_handler() { echo "Catching signal" touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal exit } trap my_handler USR1 trap my_handler TERM sleep 3600 & wait
reference: https://bugs.schedmd.com/show_bug.cgi?id=9715
The 'mail' does not work in a batch job; use 'sendmail' instead as:
sendmail user@example.com <<EOF subject: Output path from $SLURM_JOB_ID from: user@example.com ... EOF
srun
with no arguments is to allocate a single task when using sinteractive
to request an interactive job, even you request more than one task. Please pass the needed arguments to srun
:
[xwang@owens-login04 ~]$ sinteractive -n 2 -A PZS0712 ... [xwang@o0019 ~]$ srun hostname o0019.ten.osc.edu [xwang@o0019 ~]$ srun -n 2 hostname o0019.ten.osc.edu o0019.ten.osc.edu
Unlike a PBS batch output file, which lived in a user-non-writeable directory while the job was running, a Slurm batch output file resides under the user's home directory while the job is running. File operations, such as editing and copying, are permitted. Please be careful to avoid such operations while the job is running. In particular, this batch script idiom is no longer correct (e.g., for the default job output file of name $SLURM_SUBMIT_DIR/slurm-jobid.out):