This section summarizes two groups of batch-related commands: commands that are run on the login nodes to manage your jobs and commands that are run only inside a batch script. Only the most common options are described here.
Many of these commands are discussed in more detail elsewhere in this document. All have online manual pages (example: man sbatch
) unless otherwise noted.
In describing the usage of the commands we use square brackets [like this] to indicate optional arguments. The brackets are not part of the command.
Important note: The batch systems on Pitzer, Ruby, and Owens are entirely separate. Be sure to submit your jobs on a login node for the system you want them to run on. All monitoring while the job is queued or running must be done on the same system also. Your job output, of course, will be visible from both systems.
Commands for managing your jobs
These commands are typically run from a login node to manage your batch jobs. The batch systems on Pitzer and Owens are completely separate, so the commands must be run on the system where the job is to be run.
sbatch
The sbatch
command is used to submit a job to the batch system.
Usage | Desctiption | Example |
sbatch [ options ] script |
Submit a script for a batch job. The options list is rarely used but can augment or override the directives in the header lines of the script. | sbatch sim.job |
sbatch -t array_request [ options ] jobid |
Submit an array of jobs | sbatch -t 1-100 sim.job |
sinteractive [ options ] |
Submit an interactive batch job | sinteractive -n 4 |
squeue
The squeue
command is used to display the status of batch jobs.
Usage | Desctiption | Example |
squeue |
Display all jobs currently in the batch system. | squeue |
squeue -j jobid |
Display information about job jobid. The -j flag uses an alternate format. |
squeue -j 123456 |
squeue -j jobid -l |
Display long status information about job jobid. | squeue -j 123456 -l |
squeue -u username [-l] |
Display information about all the jobs belonging to user username. | squeue -u usr1234 |
scancel
The scancel
command may be used to delete a queued or running job.
Usage | Description | Example |
scancel jobid |
Delete job jobid . |
|
scancel jobid |
Delete all jobs in job array jobid . |
scancel 123456 |
qdel jobid[jobnumber] |
Delete jobnumber within job array jobid . |
scancel 123456_14 |
slurm output file
There is an output file which stores the stdout and stderr for a running job which can be viewed to check the running job output. It is by default located in the dir where the job was submitted and has the format slurm-<jobid>.out
The output file can also be renamed and saved in any valid dir using the option --output=<filename pattern>
sbatch
command at job submission.e.g.
sbatch --output=$HOME/test_slurm.out <job-script>
works#SBATCH --output=$HOME/test_slurm.out
does NOT work in job scriptSee slurm migration issues for details.
scontrol
The scontrol
command may be used to modify the attributes of a queued (not running) job. Not all attributes can be altered.
Usage | Description | Example |
scontrol update jobid=<jobid> [ option ] |
Alter one or more attributes a queued job. The options you can modify are a subset of the directives that can be used when submitting a job. |
|
scontrol show job=$SLURM_JOB_ID
scontrol hold/release
The qhold
command allows you to place a hold on a queued job. The job will be prevented from running until you release the hold with the qrls
command.
Usage | Description | Example |
scontrol hold jobid |
Place a user hold on job jobid |
scontrol hold 123456 |
scontrol release jobid |
Release a user hold previously placed on job jobid |
scontrol release 123456 |
scontrol show
The scontrol show
command can be used to provide details about a job that is running.
scontrol show job=$SLURM_JOB_ID
Usage | Description | Example |
scontrol show job=<jobid> |
Check the details of a running job. | scontrol show job=123456 |
estimating start time
The squeue
command can try to estimate when a queued job will start running. It is extremely unreliable, often making large errors in either direction.
Usage | Description | Example |
squeue -j jobid \ --Format=username,jobid,account,startTime |
Display estimate of start time. |
squeue -j 123456 \ --Format=username,jobid,account,startTime |
Commands used only inside a batch job
These commands can only be used inside a batch job.
srun
Generally used to start an mpi process during a job. Can use most of the options available also from the sbatch command.
Usage | Example |
---|---|
srun <prog> | srun --ntasks-per-node=4 a.out |
sbcast/sgather
Tool for copying files to/from all nodes allocated in a job.
Usage |
---|
sbcast <src_file> <nodelocaldir>/<dest_file> |
sgather <src_file> <shareddir>/<dest_file> sgather -r <src_dir> <sharedir>/dest_dir> |
Note: sbcast does not have a recursive cast option, meaning you can't use sbcast -r
to scatter multiple files in a directory. Instead, you may use a loop command similar to this:
cd ${the directory that has the files}
for FILE in *
do
sbcast -p $FILE $TMPDIR/some_directory/$FILE
done
mpiexec
Use the mpiexec
command to run a parallel program or to run multiple processes simultaneously within a job. It is a replacement program for the script mpirun
, which is part of the mpich package.
The OSC version of mpiexec
is customized to work with our batch environment. There are other mpiexec programs in existence, but it is imperative that you use the one provided with our system.
Usage | Description | Example |
mpiexec progname [ args ] |
Run the executable program progname in parallel, with as many processes as there are processors (cores) assigned to the job (nodes*ppn). |
|
mpiexec - ppn 1 progname [ args ] |
Run only one process per node. | mpiexec -ppn 1 myprog |
mpiexec - ppn num progname [ args ] |
Run the specified number of processes on each node. | mpiexec -ppn 3 myprog |
mpiexec -tv [ options ] progname [ args ] |
Run the program with the TotalView parallel debugger. |
|
mpiexec -np num progname [ args ] |
Run only the specified number of processes. ( -n and -np are equivalent.) Does not spread processes out evenly across nodes. |
mpiexec -n 3 myprog |
pbsdcp
The pbsdcp
command is a distributed copy command for the Slurm environment. It copies files to or from each node of the cluster assigned to your job. This is needed when copying files to directories which are not shared between nodes, such as $TMPDIR
.
Options are -r
for recursive and -p
to preserve modification times and modes.
Usage | Description | Example |
pbsdcp [-s] [ options ] srcfiles target |
“Scatter”. Copy one or more files from shared storage to the target directory on each node (local storage). The -s flag is optional. |
|
pbsdcp -g [ options ] srcfiles target |
“Gather”. Copy the source files from each node to the shared target directory. Wildcards must be enclosed in quotes. | pbsdcp -g '$TMPDIR/outfile*' $PBS_O_WORKDIR |
Note: In gather mode, if files on different nodes have the same name, they will overwrite each other. In the -g
example above, the file names may have the form outfile001
, outfile002
, etc., with each node producing a different set of files.