Job Scripts

A job script, or PBS batch script, is a text file containing job setup information for the batch system followed by commands to be executed. It can be created using any text editor and may be given any name. Some people like to name their scripts something like myscript.job or myscript.pbs, but myscript works just as well.

A job script is simply a shell script. It consists of PBS directives, comments, and executable statements. The # character indicates a comment, although lines beginning with #PBS are interpreted as PBS directives. Blank lines can be included for readability.

PBS header lines

At the top of a PBS script are several lines starting with #PBS. These are PBS directives or header lines. They provide job setup information used by PBS, including resource requests, email options, and more. The header lines may appear in any order, but they must precede any executable lines in your script. Alternatively you may provide these directives (without the #PBS notation) on the qsub command line.

Resource limits

The -l option is used to request resources, including nodes, memory, time, and software flags, as described below.

Wall clock time

The walltime limit is the maximum time your job will be allowed to run, given in seconds or hours:minutes:seconds. This is elapsed time. If your job exceeds the requested time, the batch system will kill it. If your job ends early, you will be charged only for the time used.

The default value for walltime is 1:00:00 (one hour).

To request 20 hours of wall clock time:

#PBS -l walltime=20:00:00

It is to your advantage to come up with a good estimate of the time your job will take. An underestimate will lead to your job being killed. A large overestimate may prevent your job from being backfilled, or fit into an empty time slot.

Nodes

The nodes resource limit specifies not just the number of nodes but also the properties of those nodes. The properties are different on different clusters but may include the number of processors per node (ppn), the number of GPUs per node (gpus), and the type of node.

You should always specify a number of nodes. The default is nodes=1:ppn=1, but this fails under some circumstances.

Note the differences between Oakley and Glenn in the following examples.

To request a single processor (sequential job):

#PBS -l nodes=1:ppn=1

To request one whole node on Oakley:

#PBS -l nodes=1:ppn=12

To request one node with 8 cores on Glenn:

#PBS -l nodes=1:ppn=8

To request 4 whole nodes on Oakley:

#PBS -l nodes=4:ppn=12

To request 10 nodes with 2 GPUs each on Oakley:

#PBS -l nodes=10:ppn=12:gpus=2

To request 1 node with use of 6 cores and 1 GPU on Oakley:

#PBS -l nodes=1:ppn=6:gpus=1

To request 1 GPU node on Glenn (must request whole node):

#PBS -l nodes=1:ppn=8:gpu

Note: We recommend that parallel jobs always request full nodes (ppn=12 on Oakley, ppn=8 on Glenn). Parallel jobs that request less than the full number of processors per node may have other jobs scheduled on their nodes on Oakley. On

Glenn, parallel jobs are always given full nodes. You can easily use just part of each node even if you request the whole thing (see the -npernode option on mpiexec).

Memory

The memory limit is the total amount of memory needed across all nodes. There is no need to specify a memory limit unless your memory requirements are disproportionate to the number of cores you are requesting or you need a large-memory node. For parallel jobs you must multiply the memory needed per node by the number of nodes to get the correct limit; you should usually request whole nodes and omit the memory limit.

Default units are bytes, but values are usually expressed in megabytes (mem=4000MB) or gigabytes (mem=4GB).

To request 4GB memory (see note below):

#PBS -l mem=4GB

or

#PBS -l mem=4000MB

To request 24GB memory, perhaps with nodes=1:ppn=6 on Oakley:

#PBS -l mem=24000MB

Note: The amount of memory available per node is slightly less than the nominal amount. If you want to request a fraction of the memory on a node, we recommend you give the amount in MB, not GB; 24000MB is less than 24GB. (Powers of 2 vs. powers of 10 -- ask a computer science major.)

Software licenses

If you are using a software package with a limited number of licenses, you should include the license requirement in your script. See the OSC documentation for the specific software package for details.

Example requesting five abaqus licenses:

#PBS -l software=abaqus+5

Job name

You can optionally give your job a meaningful name. If you don’t supply a name, the script file name is used. The job name is used as part of the name of the job log files; it also appears in lists of queued and running jobs. The name may be up to 15 characters in length, no spaces are allowed, and the first character must be alphabetic.

Example:

#PBS -N my_first_job

Mail options

You may choose to receive email when your job begins (b), when it ends (e), and/or when it is aborted by the batch system (a). The email will be sent to the address we have on record for you. You should use only one -m directive and include all the options you want.

To get email when your job ends or aborts:

#PBS -m ae

To get email when your job begins, ends or aborts:

#PBS -m abe

Job log files

By default, PBS returns two log files, one for the standard output stream (stdout), the other for the standard error stream (stderr). You can optionally join both data streams into a single log file. You can also optionally specify names for the log files.

For job 123456 with name my_first_job, the output log will be named my_first_job.o123456 and the error log will be named my_first_job.e123456.

To join the output and error log files, giving only my_first_job.o123456:

#PBS -j oe

File space

Each node has local scratch disk space available to your job as $TMPDIR. On Oakley all nodes have the same amount. On Glenn some nodes have larger temporary file space. The only time you should explicitly request file space is when you need nodes with large temporary file space on Glenn.

Example:

#PBS -l file=1000gb

Shell

There is rarely a need to specify a shell. Your script will be executed using your default (login) shell unless you request a different shell.

For example, to have your job executed under ksh:

#PBS -S /bin/ksh

Executable section

The executable section of your script comes after the header lines. The content of this section depends entirely on what you want your job to do. We mention just two commands that you might find useful in some circumstances. They should be placed at the top of the executable section if you use them.

The “set -x” command (“set echo” in csh) is useful for debugging your script. It causes each command in the batch file to be printed to the log file as it is executed, with a + in front of it. Without this command, only the actual display output appears in the log file.

To echo commands in bash or ksh:

set -x

To echo commands in tcsh or csh:

set echo on

The trap command allows you to specify a command to run in case your job terminates abnormally, for example if it runs out of wall time. It is typically used to copy output files from a temporary directory to a home or project directory. The following example creates a directory in $PBS_O_WORKDIR and copies everything from $TMPDIR into it. This executes only if the job terminates abnormally.

trap "cd $PBS_O_WORKDIR;mkdir $PBS_JOBID;cp -R $TMPDIR/* $PBS_JOBID" TERM

Considerations for parallel jobs

Each processor on our system is fast, but the real power of supercomputing comes from putting multiple processors to work on a task. This section addresses issues related to multithreading and parallel processing as they affect your batch script. For a more general discussion of parallel computing see another document.

Multithreading involves a single process, or program, that uses multiple threads to take advantage of multiple cores on a single node. The most common approach to multithreading on HPC systems is OpenMP. The threads of a process share a single memory space.

The more general form of parallel processing involves multiple processes, usually copies of the same program, which may run on a single node or on multiple nodes. These processes have separate memory spaces. If they communicate or share data, it is most commonly done through the Message-Passing Interface (MPI).

A program may use multiple levels of parallelism, employing MPI to communicate between nodes and OpenMP to utilize multiple processors on each node.

Note: While many executables will run on both Oakley and Glenn, MPI programs must be built on the system they will run on. Most scientific programs will run faster if they are built on the system where they’re going to run.

Script issues in parallel jobs

In a parallel job your script executes on just the first node assigned to the job, so it’s important to understand how to make your job execute properly in a parallel environment. These notes apply to jobs running on multiple nodes.

You can think of the commands (executable lines) in your script as falling into four categories.

  • Commands that affect only the shell environment. These include such things as cd, module, and export (or setenv). You don’t have to worry about these. The commands are executed on just the first node, but the batch system takes care of transferring the environment to the other nodes.
  • Commands that you want to have execute on only one node. These might include date or echo. (Do you really want to see the date printed 20 times in a 20-node job?) They might also include cp if your parallel program expects files to be available only on the first node. You don’t have to do anything special for these commands.
  • Commands that have parallel execution, including knowledge of the batch system, built in. These include pbsdcp (parallel file copy) and some application software installed by OSC. You should consult the software documentation for correct parallel usage of application software.
  • Any other command or program that you want to have execute in parallel must be run using mpiexec. Otherwise it will run on only one node, while the other nodes assigned to the job will remain idle. See examples below.

mpiexec

The mpiexec command is used to run multiple copies of an executable program, usually (but not always) on multiple nodes. It is a replacement for the mpirun script which is part of the mpich package. Message-passing (MPI) programs must always be started with mpiexec.

Very important note: The mpiexec command installed at OSC is customized to work with the OSC environment and with our batch system. Other versions will not work correctly on our systems.

The mpiexec command has the form:

mpiexec [mpiexec-options] progname [prog-args]

where mpiexec-options is a list of options to mpiexec , progname is the program you want to run, and prog-args is a list of arguments to the program. Note that if the program is not in your path or your current working directory, you must specify the path as part of the name.

By default, mpiexec runs as many copies of progname as there are processors (cores) assigned to the job (nodes x ppn). For example, if your job requested nodes=4:ppn=12, the following command will run 48 a.out processes:

mpiexec a.out

The example above can be modified to pass arguments to a.out. The following example shows two arguments:

mpiexec a.out abc.dat 123

If your program is multithreaded, or if it uses a lot of memory, it may be desirable to run just one process per node. The -pernode option does this. Modifying the above example again, the following example would run 4 copies of a.out, one on each node:

mpiexec -pernode a.out abc.dat 123

You can specify how many processes to run per node using the -npernode option. You cannot specify more processes per node than the number of cores your job requested per node (ppn value).To run 2 processes per node:

mpiexec -npernode 2 a.out abc.dat 123

It is also possible to specify the total number of processes to run using the -n or -np option. (These are the same thing.) This option is useful primarily for single-node jobs because it does not necessarily spread the processes out evenly over all the nodes. For example, if your job requested nodes=1:ppn=12, the following command will run 4 a.out processes:

mpiexec -n 4 a.out abc.dat 123

The -tv option on mpiexec runs your program with the TotalView parallel debugger. For example, assuming nodes=4:ppn=12, the following command lets you debug your program a.out with one process per node and the arguments given:

mpiexec -tv -pernode a.out abc.dat 123

System commands can also be run with mpiexec. The following commands create a directory named data in the $TMPDIR directory on each node:

cd $TMPDIR
mpiexec -pernode mkdir data

pbsdcp

If you use $TMPDIR in a parallel job, you will probably want to copy files to or from all the nodes in your job. The pbsdcp (“PBS Distributed Copy”) command is used for this task.

The following examples illustrate how to copy two files, a directory (recursively), and all files starting with “model” from your current directory to all the nodes assigned to your job.

pbsdcp file1 file2 $TMPDIR
pbsdcp -r dir1 $TMPDIR
pbsdcp model* $TMPDIR

The following example illustrates how to copy all files starting with “outfile” from all the nodes assigned to your job back to the directory you submitted your job from. The files from all the nodes will be placed into a single directory; you should name them differently to avoid name collisions. The quotes are necessary in gather mode (-g) if you use a wildcard (*) in your file name.

pbsdcp -g '$TMPDIR/outfile*' $PBS_O_WORKDIR

Environment variables for MPI

If your program combines MPI and OpenMP (or another multithreading technique), you should disable processor affinity by setting the environment variable $MV2_ENABLE_AFFINITY to 0 in your script. If you don’t disable affinity, all your threads will run on the same core, negating any benefit from multithreading.

This does not apply if you are using MPI-1, which is available only on Glenn.

To set the environment variable in bash, include this line in your script:

export MV2_ENABLE_AFFINITY=0

To set the environment variable in csh, include this line in your script:

setenv MV2_ENABLE_AFFINITY 0

Environment variables for OpenMP

The number of threads used by an OpenMP program is typically controlled by the environment variable $OMP_NUM_THREADS. If this variable isn't set, the number of threads defaults to the following, although it can be overridden by the program:

  • on Oakley: the number of cores you requested per node (ppn value)
  • on Glenn: the total number of cores in the node

If your job runs just one process per node and is the only job running on the node, the default behavior is what you want. Otherwise you should set $OMP_NUM_THREADS to a value that ensures that the total number of threads for all your processes on the node does not exceed the ppn value your job requested.

For example, to set the environment variable to a value of 12 in bash, include this line in your script:

export OMP_NUM_THREADS=12

For example, to set the environment variable to a value of 12 in csh, include this line in your script:

setenv OMP_NUM_THREADS 12

Note: Some programs ignore $OMP_NUM_THREADS and determine a number of threads programmatically.

Batch script examples

Simple sequential job

The following is an example of a single-processor sequential job that uses $TMPDIR as its working area. It assumes that the program mysci has already been built. The script copies its input file from the directory the qsub command was called from into $TMPDIR, runs the code in $TMPDIR, and copies the output files back to the original directory.

#PBS -N myscience
#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -j oe
cd $PBS_O_WORKDIR
cp mysci.in $TMPDIR
cd $TMPDIR    
/usr/bin/time ./mysci > mysci.hist
cp mysci.hist mysci.out $PBS_O_WORKDIR

Serial job with OpenMP multithreading

This example uses 1 node with 12 cores, which is suitable for Oakley. A similar job on Glenn would use 8 cores; the OMP_NUM_THREADS environment variable would also be set to 8. A program must be written to take advantage of multithreading for this to work.

#PBS -N my_job
#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=12
#PBS -j oe
cd $TMPDIR
export OMP_NUM_THREADS=12
./a.out > my_results
cp my_results $PBS_O_WORKDIR

Simple parallel job

Here is an example of an MPI job that uses 4 nodes with 12 cores each, running one process per core (48 processes total). This assumes a.out was built with the gnu compiler in order to illustrate the module command. The module swap is necessary on Oakley when running MPI programs built with a compiler other than Intel.

#PBS -N my_job
#PBS -l walltime=10:00:00
#PBS -l nodes=4:ppn=12
#PBS -j oe
module swap intel gnu
cd $PBS_O_WORKDIR
pbsdcp a.out $TMPDIR
cd $TMPDIR
mpiexec a.out
pbsdcp -g 'results*' $PBS_O_WORKDIR

Parallel job with MPI and OpenMP

This example is a hybrid MPI/OpenMP job. It runs one MPI process per node with 12 threads per process. The assumption here is that the code was written to support multilevel parallelism. The executable is named hybridprogram.

#PBS -N my_job
#PBS -l walltime=20:00:00
#PBS -l nodes=4:ppn=12
#PBS -j oe
export OMP_NUM_THREADS=12
export MV2_ENABLE_AFFINITY=0
cd $PBS_O_WORKDIR
pbsdcp hybridprogram $TMPDIR
cd $TMPDIR
mpiexec -pernode hybridprogram
pbsdcp -g 'results*' $PBS_O_WORKDIR