The usage of combing the
--ntasks
and --ntasks-per-node
options in a job script can cause some unexpected resource allocations and placement due to a bug in Slurm 23. OSC users are strongly encouraged to review their job scripts for jobs that request both --ntasks
and --ntasks-per-node
. Jobs should request either --ntasks
or --ntasks-per-node
, not both.A job script is a text file containing job setup information for the batch system followed by commands to be executed. It can be created using any text editor and may be given any name. Some people like to name their scripts something like myscript.job or myscript.sh, but myscript works just as well.
A job script is simply a shell script. It consists of Slurm directives, comments, and executable statements. The #
character indicates a comment, although lines beginning with #SBATCH
are interpreted as Slurm directives. Blank lines can be included for readability.
Contents
- SBATCH header lines
- Resource limits
- Executable section
- Considerations for parallel jobs
- Batch script examples
SBATCH header lines
A job script must start with a shabang #!
(#!/bin/bash
is commonly used but you can choose others) following by several lines starting with #SBATCH
. These are Slurm SBATCH directives or header lines. They provide job setup information used by Slurm, including resource requests, email options, and more. The header lines may appear in any order, but they must precede any executable lines in your script. Alternatively, you may provide these directives (without the #SBATCH
notation) on the command line with the sbatch
command.
$ sbatch --jobname=test_job myscript.sh
Resource limits
Options used to request resources, including nodes, memory, time, and software flags, as described below.
Walltime
The walltime limit is the maximum time your job will be allowed to run, given in seconds or hours:minutes:seconds. This is elapsed time. If your job exceeds the requested time, the batch system will kill it. If your job ends early, you will be charged only for the time used.
The default value for walltime is 1:00:00 (one hour).
To request 20 hours of wall clock time:
#SBATCH --time=20:00:00
It is important to carefully estimate the time your job will take. An underestimate will lead to your job being killed. A large overestimate may prevent your job from being backfilled or fitting into an empty time slot.
Tasks, cores (cpu), nodes and GPUs
Resource limits specify not just the number of nodes but also the properties of those nodes. The properties differ between clusters but may include the number of cores per node, the number of GPUs per node (gpus), and the type of node.
SLURM uses the term task, which can be thought of as number of processes started.
Making sure that the number of tasks versus cores per task is important when using an mpi launcher such as srun.
Serial job
e.g. A node contians 40 cores, and a job requests 20 cores. Another job requests 40 cores of the 40 core node.
These are serial jobs.
To request one CPU core (sequential job), do not add any SLURM directives. The default is one node, one core, and one task.
To request 6 CPU cores on one node, in a single process:
#SBATCH --ntasks-per-node=6
Parallel job
To request 4 nodes and run a task on each which uses 40 cores:
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=10
To request 4 nodes with 10 tasks per node (the default is 1 core per task, unless using --ntasks-per-node
to set manually):
#SBATCH --nodes=4 --ntasks-per-node=10
Computing nodes on Pitzer cluster have 40 or 48 cores per node. The job can be constrained on 40-core (or 48-core) nodes only by using --constraint
:
#SBATCH --constraint=40core
GPU job
To request 2 nodes with 2 GPUs (2-GPU nodes are only available on Pitzer)
#SBATCH --nodes=2
#SBATCH --gpus-per-node=2
To request one node with use of 12 cores and 2 GPU:
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=6
#SBATCH --gpus-per-node=2
Memory
The memory limit is the total amount of memory needed across all nodes. There is no need to specify a memory limit unless you need a large-memory node or your memory requirements are disproportionate to the number of cores you are requesting. For parallel jobs you must multiply the memory needed per node by the number of nodes to get the correct limit; you should usually request whole nodes and omit the memory limit.
Default units are bytes, but values are usually expressed in megabytes (mem=4000MB) or gigabytes (mem=4GB).
To request 4GB memory (see note below):
#SBATCH --mem=4gb
or
#SBATCH --mem=4000mb
To request 24GB memory:
#SBATCH --mem=24000mb
Note: The amount of memory available per node is slightly less than the nominal amount. If you want to request a fraction of the memory on a node, we recommend you give the amount in MB, not GB; 24000MB is less than 24GB. (Powers of 2 vs. powers of 10 -- ask a computer science major.)
Software licenses
If you are using a software package with a limited number of licenses, you should include the license requirement in your script. See the OSC documentation for the specific software package for details.
Example requesting five abaqus licenses:
#SBATCH --licenses=abaqus
@osc:5
Job name
You can optionally give your job a meaningful name. The default is the name of the batch script, or just "sbatch" if the script is read on sbatch's standard input. The job name is used as part of the name of the job log files; it also appears in lists of queued and running jobs. The name may be up to 15 characters in length, no spaces are allowed, and the first character must be alphabetic.
Example:
#SBATCH --job-name=my_first_job
Mail options
You may choose to receive email when your job begins, when it ends, and/or when it fails. The email will be sent to the address we have on record for you. You should use only one --mail-type=<type>
directive and include all the options you want.
To receive an email when your job begins, ends or fails:
#SBATCH --mail-type=BEGIN,END,FAIL
To receive an email for all types:
#SBATCH --mail-type=ALL
The default email recipient is the submitting user, but you can include other users or email addresses:
#SBATCH --mail-user=osu1234,osu4321,username@osu.edu
Job log files
By default, Slurm directs both standard output and standard error to one log file. For job 123456, the log file will be named slurm-123456.out
. You can specify name for the log file.
#SBATCH --output=myjob.out.%j
where the %j
is replaced by the job ID.
Identify Project
Job scripts are required to specify a project account.
Get a list of current projects by using the OSCfinger
command and looking in the SLURM accounts section:
OSCfinger userex Login: userex Name: User Example Directory: /users/PAS1234/userex (CREATED) Shell: /bin/bash E-mail: user-ex@osc.edu Contact Type: REGULAR Primary Group: pas1234 Groups: pas1234,pas4321 Institution: Ohio Supercomputer Center Password Changed: Dec 11 2020 21:05 Password Expires: Jan 12 2021 01:05 AM Login Disabled: FALSE Password Expired: FALSE SLURM Enabled: TRUE SLURM Clusters: owens,pitzer SLURM Accounts: pas1234,pas4321 <<===== Look at me !! SLURM Default Account: pas1234 Current Logins:
To specify an account use:
#SBATCH --account=PAS4321
For more details on errors you may see when submitting a job, see messages from sbatch.
Executable section
The executable section of your script comes after the header lines. The content of this section depends entirely on what you want your job to do. We mention just two commands that you might find useful in some circumstances. They should be placed at the top of the executable section if you use them.
Command logging
The set -x
command (set echo
in csh) is useful for debugging your script. It causes each command in the batch file to be printed to the log file as it is executed, with a +
in front of it. Without this command, only the actual display output appears in the log file.
To echo commands in bash or ksh:
set -x
To echo commands in tcsh or csh:
set echo on
Signal handling
Signals to gracefully and then immediately kill a job will be sent for various circumstances, for example if it runs out of wall time or is killed due to out-of-memory. In both cases, the job may stop before all the commands in the job script can be executed.
The sbatch flag --signal
can be used to specify commands to be ran when these signals are received by the job.
Below is an example:
#!/bin/bash
#SBATCH --job-name=minimal_trap
#SBATCH --time=2:00
#SBATCH --nodes=1 --ntasks-per-node=1
#SBATCH --output=%x.%A.log
#SBATCH --signal=B:USR1@60
function my_handler() {
echo "Catching signal"
touch $SLURM_SUBMIT_DIR/job_${SLURM_JOB_ID}_caught_signal
cd $SLURM_SUBMIT_DIR
mkdir $SLURM_JOB_ID
cp -R $TMPDIR/* $SLURM_JOB_ID
exit
}
trap my_handler USR1
trap my_handler TERM
my_process &
wait
It is typically used to copy output files from a temporary directory to a home or project directory. The following example creates a directory in $SLURM_SUBMIT_DIR
and copies everything from $TMPDIR
into it. This executes only if the job terminates abnormally. In some cases, even with signal handling, the job still may not be able to execute the handler.
& wait
is needed after starting the process so that user defined signal can be received by the process. See signal handling in slurm section of slurm migration issues for details.For other details on retrieving files from unexpectedly terminated jobs see this FAQ.
Considerations for parallel jobs
Each processor on our system is fast, but the real power of supercomputing comes from putting multiple processors to work on a task. This section addresses issues related to multithreading and parallel processing as they affect your batch script. For a more general discussion of parallel computing see another document.
Multithreading involves a single process, or program, that uses multiple threads to take advantage of multiple cores on a single node. The most common approach to multithreading on HPC systems is OpenMP. The threads of a process share a single memory space.
The more general form of parallel processing involves multiple processes, usually copies of the same program, which may run on a single node or on multiple nodes. These processes have separate memory spaces. When they need to communicate or share data, these processes typically use the Message-Passing Interface (MPI).
A program may use multiple levels of parallelism, employing MPI to communicate between nodes and OpenMP to utilize multiple processors on each node.
For more details on building and running MPI/OpenMP software, see the programing environment pages for Pitzer cluster and Owens cluster.
Script issues in parallel jobs
In a parallel job your script executes on just the first node assigned to the job, so it’s important to understand how to make your job execute properly in a parallel environment. These notes apply to jobs running on multiple nodes.
You can think of the commands (executable lines) in your script as falling into four categories.
- Commands that affect only the shell environment. These include such things as
cd
,module
, andexport
(orsetenv
). You don’t have to worry about these. The commands are executed on just the first node, but the batch system takes care of transferring the environment to the other nodes. - Commands that you want to have execute on only one node. These might include
date
orecho
. (Do you really want to see the date printed 20 times in a 20-node job?) They might also includecp
if your parallel program expects files to be available only on the first node. You don’t have to do anything special for these commands. - Commands that have parallel execution, including knowledge of the batch system, built in. These include
sbcast
(parallel file copy) and some application software installed by OSC. You should consult the software documentation for correct parallel usage of application software. - Any other command or program that you want to have execute in parallel must be run using
srun
. Otherwise, it will run on only one node, while the other nodes assigned to the job will remain idle. See examples below.
srun
The srun
command runs a parallel job on cluster managed by Slurm. It is highly recommended to use srun
while you run a parallel job with MPI libraries installed at OSC, including MVAPICH2, Intel MPI and OpenMPI.
The srun command has the form:
srun [srun-options] progname [prog-args]
where srun-options
is a list of options to srun
, progname
is the program you want to run, and prog-args
is a list of arguments to the program. Note that if the program is not in your path or not in your current working directory, you must specify the path as part of the name.
By default, srun
runs as many copies of progname
as there are tasks assigned to the job. For example, if your job requested --ntasks-per-node=8
, the following command would run 8 a.out
processes (with one core per task by default):
srun a.out
The example above can be modified to pass arguments to a.out
. The following example shows two arguments:
srun a.out abc.dat 123
If the program is multithreaded, or if it uses a lot of memory, it may be desirable to run less processes per node. You can specify --ntasks-per-node
to do this. By modifying the above example with --nodes=4
, the following example would run 8 copies of a.out
, two on each node:
# start 2 tasks on each node, and each task is allocated 20 cores
srun --ntasks-per-node=2 --cpus-per-task=20 a.out abc.dat 123
System commands can also be run with srun
. The following commands create a directory named data
in the $TMPDIR
directory on each node:
cd $TMPDIR srun -n $
SLURM_JOB_NUM_NODES--ntasks-per-node=1 mkdir data
sbcast and sgather
If you use $TMPDIR
in a parallel job, you probably want to copy files to or from all the nodes. The sbcast
and sgather
commands are used for this task.
To copy one file into the directory $TMPDIR
on all nodes allocated to your job:
sbcast myprog $TMPDIR/myprog
To copy one file from the directory $TMPDIR
on all nodes allocated to your job:
sgather -k $TMPDIR/mydata all_data
where the option -k
will keep the file on the node, and all_data
is the name of the file to be created with an appendix of source node name, meaning that you will see files all_data.node1_name
, all_data.node2_name
and more in the current working directory.
To recursively copy a directory from all nodes to the directory where the job is submitted:
sgather -k -r $TMPDIR
$SLURM_SUBMIT_DIR/mydata
where mydata
is the name of the directory to be created with an appendix of source node name.
sbcast
and sgather
.Environment variables for MPI
If your program combines MPI and OpenMP (or another multithreading technique), you should disable processor affinity by setting the environment variable MV2_ENABLE_AFFINITY
to 0 in your script. If you don’t disable affinity, all your threads will run on the same core, negating any benefit from multithreading.
To set the environment variable in bash, include this line in your script:
export MV2_ENABLE_AFFINITY=0
To set the environment variable in csh, include this line in your script:
setenv MV2_ENABLE_AFFINITY 0
Environment variables for OpenMP
The number of threads used by an OpenMP program is typically controlled by the environment variable $OMP_NUM_THREADS
. If this variable isn't set, the number of threads defaults to the number of cores you requested per node, although it can be overridden by the program.
If your job runs just one process per node and is the only job running on the node, the default behavior is what you want. Otherwise, you should set $OMP_NUM_THREADS
to a value that ensures that the total number of threads for all your processes on the node does not exceed the ppn value your job requested.
For example, to set the environment variable to a value of 40 in bash, include this line in your script:
export OMP_NUM_THREADS=40
For example, to set the environment variable to a value of 40 in csh, include this line in your script:
setenv OMP_NUM_THREADS 40
Note: Some programs ignore $OMP_NUM_THREADS
and determine the number of threads programmatically.
Batch script examples
Simple sequential job
The following is an example of a single-task sequential job that uses $TMPDIR
as its working area. It assumes that the program mysci
has already been built. The script copies its input file from the directory into $TMPDIR
, runs the code in $TMPDIR
, and copies the output files back to the original directory.
#!/bin/bash
#SBATCH --account=pas1234 #SBATCH --job-name=myscience #SBATCH --time=40:00:00 cp mysci.in $TMPDIR cd $TMPDIR /usr/bin/time ./mysci > mysci.hist cp mysci.hist mysci.out $SLURM_SUBMIT_DIR
Serial job with OpenMP
The following example runs a multi-threaded program with 8 cores:
#!/bin/bash
#SBATCH --account=pas1234 #SBATCH --job-name=my_job #SBATCH --time=1:00:00 #SBATCH --ntasks-per-node=8 cp a.out $TMPDIR cd $TMPDIR export OMP_NUM_THREADS=8 ./a.out > my_results cp my_results $SLURM_SUBMIT_DIR
Simple parallel job
Here is an example of a parallel job that uses 4 nodes, running one process per core. To illustrate the module command, this example assumes a.out
was built with the GNU compiler. The module swap
command is necessary when running MPI programs built with a compiler other than Intel.
#!/bin/bash
#SBATCH --account=pas1234 #SBATCH --job-name=my_job #SBATCH --time=10:00:00 #SBATCH --nodes=4 #SBATCH --ntasks-per-node=28 module swap intel gnu sbcast a.out $TMPDIR/a.out cd $TMPDIR srun a.out
sgather -k -r $TMPDIR
$SLURM_SUBMIT_DIR/my_mpi_output
--ntasks-per-node
is set based on a compute node in the owens cluster with 28 cores.Make sure to refer to other cluster and node type core counts when adjusting this value. Cluster computing would be a good place to start.
Parallel job with MPI and OpenMP
This example is a hybrid (MPI + OpenMP) job. It runs one MPI process per node with X
threads per process, where X
must be less than or equal to physical cores per node (see the note below). The assumption here is that the code was written to support multilevel parallelism. The executable is named hybrid-program
.
#!/bin/bash
#SBATCH --account=pas1234
#SBATCH --job-name=my_job
#SBATCH --time=20:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=28
export OMP_NUM_THREADS=14
export MV2_CPU_BINDING_POLICY=hybrid
sbcast hybrid-program $TMPDIR/hybrid-program
cd $TMPDIR
srun --ntasks-per-node=2 --cpus-per-task=14 hybrid-program
sgather -k -r $TMPDIR
$SLURM_SUBMIT_DIR/my_hybrid_output
Note that computing nodes on Pitzer cluster have 40 or 48 cores per node and computing nodes on Owens cluster have 28 cores per node. If you want X
to be all physical cores per node and to be independent of clusters, use the input environment variable SLURM_CPUS_ON_NODE
:
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE