Batch Processing at OSC

The only access to significant resources on the HPC machines is through the batch process. This guide will provide an overview of OSC's computing environment, and provide some instruction for how to use the batch system to accomplish your computing goals. The menu at the right provides links to all the pages in the guide, or you can use the navigation links at the bottom of the page to step through the guide one page at a time. If you need additional assistance, please do not hesitate to contact OSC Help.

Batch System Concepts

The only access to significant resources on the HPC machines is through the batch process.

Why use a batch system?

Access to the OSC clusters is through a system of login nodes. These nodes are reserved solely for the purpose of managing your files and submitting jobs to the batch system. Acceptable activities include editing/creating files, uploading and downloading files of moderate size, and managing your batch jobs. You may also compile and link small-to-moderate size programs on the login nodes.

CPU time and memory usage are severely limited on the login nodes. There are typically many users on the login nodes at one time. Extensive calculations would degrade the responsiveness of those nodes.

The batch system allows users to submit jobs requesting the resources (nodes, processors, memory, GPUs) that they need. The jobs are queued and then run as resources become available. The scheduling policies in place on the system are an attempt to balance the desire for short queue waits against the need for efficient system utilization.

Interactive vs. batch

When you type commands in a login shell and see a response displayed, you are working interactively. To run a batch job, you put the commands into a text file instead of typing them at the prompt. You submit this file to the batch system, which will run it as soon as resources become available. The output you would normally see on your display goes into a log file. You can check the status of your job interactively and/or receive emails when it begins and ends execution.

Terminology

The batch system used at OSC is PBS, “Portable Batch System”. It consists of a resource manager, Torque, and a scheduler, Moab. You’ll need to understand the terms cluster, node,  and processor (core) in order to request resources for your job. See the Introduction to HPC if you need this background information.

The words “parallel” and “serial” as used by PBS can be a little misleading. From the point of view of the batch system a serial job is one that uses just one node, regardless of how many processors it uses on that node. Similarly, a parallel job is one that uses more than one node. More standard terminology considers a job to be parallel if it involves multiple processes.

Batch processing overview

Here is a very brief overview of how to use the batch system. See the rest of this document for more information.

Choose a cluster

Before you start preparing a job script you should decide which cluster you want your job to run on, Oakley or Glenn. This decision will probably be based on the resources available on each system. Remember which cluster you’re using because the batch systems are independent.

Prepare a job script

Your job script is a text file that includes PBS directives as well as the commands you want executed. The directives tell the batch system what resources you need, among other things. The commands can be anything you would type at the login prompt. You can prepare the script using any editor.

Submit the job

You submit your job to the batch system using the qsub command, with the name of the script file as the argument. The qsub command responds with the job ID that was given to your job, typically a 6- or 7-digit number.

Wait for the job to run

Your job may wait in the queue for minutes or days before it runs, depending on system load and the resources requested. It may then run for minutes or days. You can monitor your job’s progress or just wait for an email telling you it has finished.

Retrieve your output

The log file (screen output) from your job will be in the directory you submitted the job from by default. Any other output files will be wherever your script put them.

Batch Execution Environment

Shell and initialization

Your batch script executes in a shell on a compute node. The environment is identical to what get when you connect to a login node except that you have access to all the resources requested by your job. By default, the script is executed using the same shell that you get when you log in (bash, tcsh, etc.). The appropriate “dot-files” (.login, .profile, .cshrc) will be executed, the same as when you log in. (For information on overriding the default shell, see the Job Scripts section.)

Execution begins in your home directory, regardless of what directory your script resides in or where you submitted the job from. You can use the cd command to change to a different directory. The environment variable $PBS_O_WORKDIR makes it easy to return to the directory from which you submitted the job:

cd $PBS_O_WORKDIR

Modules

There are dozens of software packages available on OSC’s systems, many of them with multiple versions. You control what software is available in your environment by loading the module for the software you need. Each module sets certain environment variables required by the software.

If you are running software that was installed by OSC, you should check the software documentation page to find out what modules to load.

The module systems on Oakley and Glenn are a little different, but the concepts are the same. Examples are given here for both systems.

Several modules are automatically loaded for you when you login or start a batch script. These default modules differ somewhat between Oakley and Glenn, but they include

  • modules required by the batch system
  • a compiler suite (Intel compilers on Oakley, PGI compilers on Glenn)
  • an MPI package compatible with the default compiler (for parallel computing)

The module command has a number of subcommands. The most useful of these are documented here. For more details, type “module help”.

Certain modules are incompatible with each other and should never be loaded at the same time. Examples are different versions of the same software or multiple installations of a library built with different compilers. Oakley does a pretty good job of checking compatibility; unfortunately Glenn does not.

Note to those who build or install their own software: Be sure to load the same modules when you run your software that you had loaded when you built it, including the compiler module. This is particularly important on Oakley, but it is good practice on Glenn also.

Module system on Oakley

Each module on Oakley has both a name and a software version number. When more than one version is available for the same name, one of them is designated as the default. For example, the following modules are available for the Intel compilers on Oakley:

  • intel/12.1.0 (default)
  • intel/12.1.4.319

If you specify just the name, it refers to the default version or the currently loaded version, depending on the context. If you want a different version, you must give the entire string. Examples are given below.

On Oakley you can have only one compiler module loaded at a time, either intel, pgi, or gnu. The intel module is loaded initially; to change to pgi or gnu, do a module swap (see example below).

Some software libraries have multiple installations built for use with different compilers. The module system will load the one compatible with the compiler you have loaded. If you swap compilers, all the compiler-dependent modules will also be swapped.

Special note to gnu compiler users: While the gnu compilers are always in your path, you should load the gnu compiler module to ensure you are linking to the correct library versions.

To list the modules you have loaded:

module list

To see all modules that are compatible with your currently loaded modules:

module avail

To see compatible modules whose names start with fftw:

module avail fftw

To see all possible modules:

module spider

To see all possible modules whose names start with fftw:

module spider fftw

To load the fftw3 module that is compatible with your current compiler:

module load fftw3

To unload the fftw3 module:

module unload fftw3

To load the default version of the abaqus module (not compiler-dependent):

module load abaqus

To load a different version of the abaqus module:

module load abaqus/6.8-4

To unload whatever abaqus module you have loaded:

module unload abaqus

To swap the intel compilers for the pgi compilers (unloads intel, loads pgi):

module swap intel pgi

To swap the default version of the intel compilers for a different version:

module swap intel intel/12.1.4.319

To display help information for the mkl module:

module help mkl

To display the commands run by the mkl module:

module show mkl

To use a locally installed module on Oakley, first import the module directory:

module use [/path/to/modulefiles]

And then load the module:

module load localmodule

Module system on Glenn

The modules on Glenn have a software version number built into the name. Some modules have a shorter alternate name. For example, the following modules are available for the Intel compilers on Glenn:

  • intel-compilers-10.0.023
  • intel-compilers-10.0 (same as intel-compilers-10.0.023)
  • intel-compilers-11.1.056
  • intel-compilers-11.1 (same as intel-compilers-11.1.056)
  • intel-compilers-9.1

On Glenn you will typically have modules loaded for all the compiler suites (PGI, Intel, gnu).

Some software libraries have multiple installations built for use with different compilers. You must make certain that you have the correct modules loaded for compatibility with the compiler you are using.

To list the modules you have loaded:

module list

To see all modules that are available:

module avail

Same as above but restricted to modules whose names start with fftw:

module avail fftw

To load the fftw3 module compatible with the PGI compilers:

module load fftw3

To load the fftw3 module compatible with the gnu compilers:

module load fftw3-gnu

To load the default version of abaqus (not compiler-dependent):

module load abaqus

To swap the default mpi module, which works with the PGI compilers, for the mvapich2 module that works with the intel compilers:

module unload mpi
module load mvapich2-1.6-intel

Note: There is a “module swap” on Glenn, but it doesn’t always work correctly. It’s safer to unload one module and load the other one.

To display help information for the acml module:

module help acml

To display the commands run by the acml module:

module show acml

PBS environment variables

Your batch execution environment has all the environment variables that your login environment has plus several that are set by the batch system. This section gives examples for using some of them. For more information see “man qsub”.

Directories

Several directories may be useful in your job.

The absolute path of the directory your job was submitted from is $PBS_O_WORKDIR. Recall that your job always starts in your home directory. To get back to your submission directory:

cd $PBS_O_WORKDIR

Each job has a temporary directory, $TMPDIR, on the local disk of each node assigned to it. Access to this directory is much faster than access to your home or project directory. The files in this directory are not visible from all the nodes in a parallel job; each node has its own directory. The batch system creates this directory when your job starts and deletes it when your job ends. To copy file input.dat to $TMPDIR on all your job’s first node:

cp input.dat $TMPDIR

To copy file input.dat to $TMPDIR on all your job’s nodes:

pbsdcp input.dat $TMPDIR

Each job has a temporary directory, $PFSDIR, on the parallel file system. This is a single directory shared by all the nodes a job is running on. Access is faster than access to your home or project directory but not as fast as $TMPDIR. The batch system creates this directory when your job starts and deletes it when your job ends. To copy the file output.dat from this directory to the directory you submitted your job from:

cp $PFSDIR/output.dat $PBS_O_WORKDIR

The $HOME environment variable refers to your home directory. It is not set by the batch system but is useful in some job scripts. It is better to use $HOME than to hardcode the path to your home directory. To access a file in your home directory:

cat $HOME/myfile

Informational variables

Several environment variables provide information about your job that may be useful.

A list of the nodes and cores assigned to your job is in the file $PBS_NODEFILE. To display this file:

cat $PBS_NODEFILE

For GPU jobs on Oakley, a list of the GPUs assigned to your job is in the file $PBS_GPUFILE. To display this file:

cat $PBS_GPUFILE

If you use a job array, each job in the array gets its identifier within the array in the variable $PBS_ARRAYID. To pass a file name parameterized by the array ID into your application:

./a.out input${PBS_ARRAYID}.dat

To display the numeric job Identifier assigned by the batch system:

echo $PBS_JOBID

To display the job name:

echo $PBS_JOBNAME

Use fast storage

If your job does a lot of file-based input and output, your choice of file system can make a huge difference in the performance of the job.

Shared file systems

Your home and project directories are located on shared file systems, providing long-term storage that is accessible from all OSC systems. Shared file systems are relatively slow. They cannot handle heavy loads such as those generated by large parallel jobs or many simultaneous serial jobs. You should minimize the I/O your jobs do on the shared file systems. It is usually best to copy your input data to fast temporary storage, run your program there, and copy your results back to your home or project directory.

Batch-managed directories

Batch-managed directories are temporary directories that exist only for the duration of a job. They exist on two types of storage: disks local to the compute nodes and a parallel file system.

A big advantage of batch-managed directories is that the batch system deletes them when a job ends, preventing clutter on the disk.

A disadvantage of batch-managed directories is that you can’t access them after your job ends. Be sure to include commands in your script to copy any files you need to long-term storage. To avoid losing your files if your job ends abnormally, for example by hitting its walltime limit, include a trap command in your script. The following example creates a subdirectory in $PBS_O_WORKDIR and copies everything from $TMPDIR into it in case of abnormal termination.

trap "cd $PBS_O_WORKDIR;mkdir $PBS_JOBID;cp -R $TMPDIR/* $PBS_JOBID" TERM

If a node your job is running on crashes, the trap command may not be executed. It may be possible to recover your batch-managed directories in this case. Contact OSC Help for assistance.

Local disk space

The fastest storage is on a disk local to the node your job is running on, accessed through the environment variable $TMPDIR. The main drawback to local storage is that each node of a parallel job has its own directory and cannot access the files on other nodes. See also “Considerations for Parallel Jobs”.

Local disk space should be used only through the batch-managed directory created for your job. Please do not use /tmp directly because your files won’t be cleaned up properly.

Parallel file system

The parallel file system is faster than the shared file systems for large-scale I/O and can handle a much higher load. You should use it when your files must be accessible by all the nodes in your job and also when your files are too large for the local disk.

The parallel file system is efficient for reading and writing data in large blocks. It should not be used for I/O involving many small accesses.

The parallel file system is typically used through the batch-managed directory created for your job. The path for this directory is in the environment variable $PFSDIR.

You may also create a directory for yourself in /fs/lustre and use it the way you would use any other directory. You should name the directory with either your user name or your project ID. This directory will not be backed up; files are subject to deletion after some number of months (see policies for details).

Note: You should not copy your executable files to $PFSDIR. They should be run from your home or project directories or from $TMPDIR.

Job Scripts

A job script, or PBS batch script, is a text file containing job setup information for the batch system followed by commands to be executed. It can be created using any text editor and may be given any name. Some people like to name their scripts something like myscript.job or myscript.pbs, but myscript works just as well.

A job script is simply a shell script. It consists of PBS directives, comments, and executable statements. The # character indicates a comment, although lines beginning with #PBS are interpreted as PBS directives. Blank lines can be included for readability.

PBS header lines

At the top of a PBS script are several lines starting with #PBS. These are PBS directives or header lines. They provide job setup information used by PBS, including resource requests, email options, and more. The header lines may appear in any order, but they must precede any executable lines in your script. Alternatively you may provide these directives (without the #PBS notation) on the qsub command line.

Resource limits

The -l option is used to request resources, including nodes, memory, time, and software flags, as described below.

Wall clock time

The walltime limit is the maximum time your job will be allowed to run, given in seconds or hours:minutes:seconds. This is elapsed time. If your job exceeds the requested time, the batch system will kill it. If your job ends early, you will be charged only for the time used.

The default value for walltime is 1:00:00 (one hour).

To request 20 hours of wall clock time:

#PBS -l walltime=20:00:00

It is to your advantage to come up with a good estimate of the time your job will take. An underestimate will lead to your job being killed. A large overestimate may prevent your job from being backfilled, or fit into an empty time slot.

Nodes

The nodes resource limit specifies not just the number of nodes but also the properties of those nodes. The properties are different on different clusters but may include the number of processors per node (ppn), the number of GPUs per node (gpus), and the type of node.

You should always specify a number of nodes. The default is nodes=1:ppn=1, but this fails under some circumstances.

Note the differences between Oakley and Glenn in the following examples.

To request a single processor (sequential job):

#PBS -l nodes=1:ppn=1

To request one whole node on Oakley:

#PBS -l nodes=1:ppn=12

To request one node with 8 cores on Glenn:

#PBS -l nodes=1:ppn=8

To request 4 whole nodes on Oakley:

#PBS -l nodes=4:ppn=12

To request 10 nodes with 2 GPUs each on Oakley:

#PBS -l nodes=10:ppn=12:gpus=2

To request 1 node with use of 6 cores and 1 GPU on Oakley:

#PBS -l nodes=1:ppn=6:gpus=1

To request 1 GPU node on Glenn (must request whole node):

#PBS -l nodes=1:ppn=8:gpu

Note: We recommend that parallel jobs always request full nodes (ppn=12 on Oakley, ppn=8 on Glenn). Parallel jobs that request less than the full number of processors per node may have other jobs scheduled on their nodes on Oakley. On

Glenn, parallel jobs are always given full nodes. You can easily use just part of each node even if you request the whole thing (see the -npernode option on mpiexec).

Memory

The memory limit is the total amount of memory needed across all nodes. There is no need to specify a memory limit unless your memory requirements are disproportionate to the number of cores you are requesting or you need a large-memory node. For parallel jobs you must multiply the memory needed per node by the number of nodes to get the correct limit; you should usually request whole nodes and omit the memory limit.

Default units are bytes, but values are usually expressed in megabytes (mem=4000MB) or gigabytes (mem=4GB).

To request 4GB memory (see note below):

#PBS -l mem=4GB

or

#PBS -l mem=4000MB

To request 24GB memory, perhaps with nodes=1:ppn=6 on Oakley:

#PBS -l mem=24000MB

Note: The amount of memory available per node is slightly less than the nominal amount. If you want to request a fraction of the memory on a node, we recommend you give the amount in MB, not GB; 24000MB is less than 24GB. (Powers of 2 vs. powers of 10 -- ask a computer science major.)

Software licenses

If you are using a software package with a limited number of licenses, you should include the license requirement in your script. See the OSC documentation for the specific software package for details.

Example requesting five abaqus licenses:

#PBS -l software=abaqus+5

Job name

You can optionally give your job a meaningful name. If you don’t supply a name, the script file name is used. The job name is used as part of the name of the job log files; it also appears in lists of queued and running jobs. The name may be up to 15 characters in length, no spaces are allowed, and the first character must be alphabetic.

Example:

#PBS -N my_first_job

Mail options

You may choose to receive email when your job begins (b), when it ends (e), and/or when it is aborted by the batch system (a). The email will be sent to the address we have on record for you. You should use only one -m directive and include all the options you want.

To get email when your job ends or aborts:

#PBS -m ae

To get email when your job begins, ends or aborts:

#PBS -m abe

Job log files

By default, PBS returns two log files, one for the standard output stream (stdout), the other for the standard error stream (stderr). You can optionally join both data streams into a single log file. You can also optionally specify names for the log files.

For job 123456 with name my_first_job, the output log will be named my_first_job.o123456 and the error log will be named my_first_job.e123456.

To join the output and error log files, giving only my_first_job.o123456:

#PBS -j oe

File space

Each node has local scratch disk space available to your job as $TMPDIR. On Oakley all nodes have the same amount. On Glenn some nodes have larger temporary file space. The only time you should explicitly request file space is when you need nodes with large temporary file space on Glenn.

Example:

#PBS -l file=1000gb

Shell

There is rarely a need to specify a shell. Your script will be executed using your default (login) shell unless you request a different shell.

For example, to have your job executed under ksh:

#PBS -S /bin/ksh

Executable section

The executable section of your script comes after the header lines. The content of this section depends entirely on what you want your job to do. We mention just two commands that you might find useful in some circumstances. They should be placed at the top of the executable section if you use them.

The “set -x” command (“set echo” in csh) is useful for debugging your script. It causes each command in the batch file to be printed to the log file as it is executed, with a + in front of it. Without this command, only the actual display output appears in the log file.

To echo commands in bash or ksh:

set -x

To echo commands in tcsh or csh:

set echo on

The trap command allows you to specify a command to run in case your job terminates abnormally, for example if it runs out of wall time. It is typically used to copy output files from a temporary directory to a home or project directory. The following example creates a directory in $PBS_O_WORKDIR and copies everything from $TMPDIR into it. This executes only if the job terminates abnormally.

trap "cd $PBS_O_WORKDIR;mkdir $PBS_JOBID;cp -R $TMPDIR/* $PBS_JOBID" TERM

Considerations for parallel jobs

Each processor on our system is fast, but the real power of supercomputing comes from putting multiple processors to work on a task. This section addresses issues related to multithreading and parallel processing as they affect your batch script. For a more general discussion of parallel computing see another document.

Multithreading involves a single process, or program, that uses multiple threads to take advantage of multiple cores on a single node. The most common approach to multithreading on HPC systems is OpenMP. The threads of a process share a single memory space.

The more general form of parallel processing involves multiple processes, usually copies of the same program, which may run on a single node or on multiple nodes. These processes have separate memory spaces. If they communicate or share data, it is most commonly done through the Message-Passing Interface (MPI).

A program may use multiple levels of parallelism, employing MPI to communicate between nodes and OpenMP to utilize multiple processors on each node.

Note: While many executables will run on both Oakley and Glenn, MPI programs must be built on the system they will run on. Most scientific programs will run faster if they are built on the system where they’re going to run.

Script issues in parallel jobs

In a parallel job your script executes on just the first node assigned to the job, so it’s important to understand how to make your job execute properly in a parallel environment. These notes apply to jobs running on multiple nodes.

You can think of the commands (executable lines) in your script as falling into four categories.

  • Commands that affect only the shell environment. These include such things as cd, module, and export (or setenv). You don’t have to worry about these. The commands are executed on just the first node, but the batch system takes care of transferring the environment to the other nodes.
  • Commands that you want to have execute on only one node. These might include date or echo. (Do you really want to see the date printed 20 times in a 20-node job?) They might also include cp if your parallel program expects files to be available only on the first node. You don’t have to do anything special for these commands.
  • Commands that have parallel execution, including knowledge of the batch system, built in. These include pbsdcp (parallel file copy) and some application software installed by OSC. You should consult the software documentation for correct parallel usage of application software.
  • Any other command or program that you want to have execute in parallel must be run using mpiexec. Otherwise it will run on only one node, while the other nodes assigned to the job will remain idle. See examples below.

mpiexec

The mpiexec command is used to run multiple copies of an executable program, usually (but not always) on multiple nodes. It is a replacement for the mpirun script which is part of the mpich package. Message-passing (MPI) programs must always be started with mpiexec.

Very important note: The mpiexec command installed at OSC is customized to work with the OSC environment and with our batch system. Other versions will not work correctly on our systems.

The mpiexec command has the form:

mpiexec [mpiexec-options] progname [prog-args]

where mpiexec-options is a list of options to mpiexec , progname is the program you want to run, and prog-args is a list of arguments to the program. Note that if the program is not in your path or your current working directory, you must specify the path as part of the name.

By default, mpiexec runs as many copies of progname as there are processors (cores) assigned to the job (nodes x ppn). For example, if your job requested nodes=4:ppn=12, the following command will run 48 a.out processes:

mpiexec a.out

The example above can be modified to pass arguments to a.out. The following example shows two arguments:

mpiexec a.out abc.dat 123

If your program is multithreaded, or if it uses a lot of memory, it may be desirable to run just one process per node. The -pernode option does this. Modifying the above example again, the following example would run 4 copies of a.out, one on each node:

mpiexec -pernode a.out abc.dat 123

You can specify how many processes to run per node using the -npernode option. You cannot specify more processes per node than the number of cores your job requested per node (ppn value).To run 2 processes per node:

mpiexec -npernode 2 a.out abc.dat 123

It is also possible to specify the total number of processes to run using the -n or -np option. (These are the same thing.) This option is useful primarily for single-node jobs because it does not necessarily spread the processes out evenly over all the nodes. For example, if your job requested nodes=1:ppn=12, the following command will run 4 a.out processes:

mpiexec -n 4 a.out abc.dat 123

The -tv option on mpiexec runs your program with the TotalView parallel debugger. For example, assuming nodes=4:ppn=12, the following command lets you debug your program a.out with one process per node and the arguments given:

mpiexec -tv -pernode a.out abc.dat 123

System commands can also be run with mpiexec. The following commands create a directory named data in the $TMPDIR directory on each node:

cd $TMPDIR
mpiexec -pernode mkdir data

pbsdcp

If you use $TMPDIR in a parallel job, you will probably want to copy files to or from all the nodes in your job. The pbsdcp (“PBS Distributed Copy”) command is used for this task.

The following examples illustrate how to copy two files, a directory (recursively), and all files starting with “model” from your current directory to all the nodes assigned to your job.

pbsdcp file1 file2 $TMPDIR
pbsdcp -r dir1 $TMPDIR
pbsdcp model* $TMPDIR

The following example illustrates how to copy all files starting with “outfile” from all the nodes assigned to your job back to the directory you submitted your job from. The files from all the nodes will be placed into a single directory; you should name them differently to avoid name collisions. The quotes are necessary in gather mode (-g) if you use a wildcard (*) in your file name.

pbsdcp -g '$TMPDIR/outfile*' $PBS_O_WORKDIR

Environment variables for MPI

If your program combines MPI and OpenMP (or another multithreading technique), you should disable processor affinity by setting the environment variable $MV2_ENABLE_AFFINITY to 0 in your script. If you don’t disable affinity, all your threads will run on the same core, negating any benefit from multithreading.

This does not apply if you are using MPI-1, which is available only on Glenn.

To set the environment variable in bash, include this line in your script:

export MV2_ENABLE_AFFINITY=0

To set the environment variable in csh, include this line in your script:

setenv MV2_ENABLE_AFFINITY 0

Environment variables for OpenMP

The number of threads used by an OpenMP program is typically controlled by the environment variable $OMP_NUM_THREADS. If this variable isn't set, the number of threads defaults to the following, although it can be overridden by the program:

  • on Oakley: the number of cores you requested per node (ppn value)
  • on Glenn: the total number of cores in the node

If your job runs just one process per node and is the only job running on the node, the default behavior is what you want. Otherwise you should set $OMP_NUM_THREADS to a value that ensures that the total number of threads for all your processes on the node does not exceed the ppn value your job requested.

For example, to set the environment variable to a value of 12 in bash, include this line in your script:

export OMP_NUM_THREADS=12

For example, to set the environment variable to a value of 12 in csh, include this line in your script:

setenv OMP_NUM_THREADS 12

Note: Some programs ignore $OMP_NUM_THREADS and determine a number of threads programmatically.

Batch script examples

Simple sequential job

The following is an example of a single-processor sequential job that uses $TMPDIR as its working area. It assumes that the program mysci has already been built. The script copies its input file from the directory the qsub command was called from into $TMPDIR, runs the code in $TMPDIR, and copies the output files back to the original directory.

#PBS -N myscience
#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -j oe
cd $PBS_O_WORKDIR
cp mysci.in $TMPDIR
cd $TMPDIR    
/usr/bin/time ./mysci > mysci.hist
cp mysci.hist mysci.out $PBS_O_WORKDIR

Serial job with OpenMP multithreading

This example uses 1 node with 12 cores, which is suitable for Oakley. A similar job on Glenn would use 8 cores; the OMP_NUM_THREADS environment variable would also be set to 8. A program must be written to take advantage of multithreading for this to work.

#PBS -N my_job
#PBS -l walltime=1:00:00
#PBS -l nodes=1:ppn=12
#PBS -j oe
cd $TMPDIR
export OMP_NUM_THREADS=12
./a.out > my_results
cp my_results $PBS_O_WORKDIR

Simple parallel job

Here is an example of an MPI job that uses 4 nodes with 12 cores each, running one process per core (48 processes total). This assumes a.out was built with the gnu compiler in order to illustrate the module command. The module swap is necessary on Oakley when running MPI programs built with a compiler other than Intel.

#PBS -N my_job
#PBS -l walltime=10:00:00
#PBS -l nodes=4:ppn=12
#PBS -j oe
module swap intel gnu
cd $PBS_O_WORKDIR
pbsdcp a.out $TMPDIR
cd $TMPDIR
mpiexec a.out
pbsdcp -g 'results*' $PBS_O_WORKDIR

Parallel job with MPI and OpenMP

This example is a hybrid MPI/OpenMP job. It runs one MPI process per node with 12 threads per process. The assumption here is that the code was written to support multilevel parallelism. The executable is named hybridprogram.

#PBS -N my_job
#PBS -l walltime=20:00:00
#PBS -l nodes=4:ppn=12
#PBS -j oe
export OMP_NUM_THREADS=12
export MV2_ENABLE_AFFINITY=0
cd $PBS_O_WORKDIR
pbsdcp hybridprogram $TMPDIR
cd $TMPDIR
mpiexec -pernode hybridprogram
pbsdcp -g 'results*' $PBS_O_WORKDIR

Job Submission

Job scripts are submitted to the batch system using the qsub command.  Be sure to submit your job on the system you want your job to run on.

The batch systems on the two clusters are entirely separate; neither knows anything about the jobs queued or running on the other system.  You may edit your batch scripts anywhere, but you must submit and monitor your jobs from a login node on the system where you want to run.

Standard batch job

Most jobs on our system are submitted as scripts with no command-line options. If your script is in a file named “myscript”:

qsub myscript

In response to this command you’ll see a line with your job ID:

123456.oak-batch.osc.edu

You’ll use this job ID (numeric part only) in monitoring your job. You can find it again using the “qstat -u userid” command described elsewhere.

When you submit a job, the script is copied by the batch system. Any changes you make subsequently to the script file will not affect the job. Your input files and executables, on the other hand, are not picked up until the job starts running.

Interactive batch

The batch system supports an interactive batch mode. This mode is useful for debugging parallel programs or running a GUI program that’s too large for the login node. The resource limits (memory, CPU) for an interactive batch job are the same as the standard batch limits.

Interactive batch jobs are generally invoked without a script file, for example:

qsub -I -X -l nodes=2:ppn=12 -l walltime=1:00:00

The -I flag indicates that the job is interactive. The -X flag enables X11 forwarding, which is necessary with a GUI. You will need to have a X11 server running on your computer to use X11 forwarding [see more]. The remaining flags in this example are resource requests with the same meaning as the corresponding header lines in a batch file.

After you enter this line, you’ll see something like the following:

qsub: waiting for job 123456.opt-batch.osc.edu to start

Your job will be queued just like any job. When the job runs, you’ll see the following line:

qsub: job 123456.opt-batch.osc.edu ready

At this point, you have an interactive login shell on one of the compute nodes, which you can treat like any other login shell.

It is important to remember that OSC systems are optimized for batch processing, not interactive computing. If the system load is high, your job may wait for hours in the queue, making interactive batch impractical. Requesting a walltime limit of one hour or less is recommended because your job can run on nodes reserved for debugging.

Job arrays

If you submit many similar jobs at the same time, you should consider using a job array. With a single qsub command, you can submit multiple jobs that will use the same script. Each job has a unique identifier, $PBS_ARRAYID, which can be used to parameterize its behavior.

Individual jobs in a job array are scheduled independently, but some job management tasks can be performed on the entire array.

To submit an array of jobs numbered from 1 to 100, all using the script sim.job:

qsub -t 1-100 sim.job

The script would use the environment variable $PBS_ARRAYID, possibly as an input argument to an application or as part of a file name.

Job dependencies

It is possible to set conditions on when a job can start. The most common of these is a dependency relationship between jobs.

For example, to ensure that the job being submitted (with script sim.job) does not start until after job 123456 has finished:

qsub -W depend=afterany:123456 sim.job

Many other options are available, some quite complicated; for more information, see the qsub online manual by using the command:

man qsub

Monitoring and Managing Your Job

There are several commands available that allow you to check the status of your job, monitor execution of a running job, and collect performance statistics for your job. You can also delete a job if necessary.

Status of queued jobs

You can monitor the batch queues and check the status of your job using the commands qstat and showq. There is also a command to get an extremely unreliable estimate of the time your job will start. This section also addresses the question of why a job may have a long queue wait and explains a little about how job scheduling works.

qstat

Use the qstat command to check the status of your jobs. You can see whether your job is queued or running, along with information about requested resources. If the job is running you can see elapsed time and resources used.

Here are some examples for user usr1234 and job 123456.

By itself, qstat lists all jobs in the system:

qstat

To list all the jobs belonging to a particular user:

qstat -u usr1234

To list the status of a particular job, in standard or alternate (more useful!) format:

qstat 123456
qstat -a 123456

To get all the details about a particular job (full status):

qstat -f 123456

showq

The showq command lists job information from the point of view of the scheduler.  Jobs are grouped according to their state: running, idle, or blocked.

To list all jobs in the system:

showq

To list all jobs belonging to a particular user (-u flag may be combined with others):

showq -u usr1234

Idle jobs are those that are eligible to run; they are listed in priority order. Note that the priority order may change over time. Note also that jobs may be run out of order if resources are not immediately available to run the highest priority job (“backfill”). This is done in such a way that it does not delay the start of the highest priority job.

To list details about idle jobs:

showq -i
showq -i -u usr1234

Blocked jobs are those that are not currently eligible to run. There are several reasons a job may be blocked.

  • If a user or group has reached the limit on the number of jobs or cores allowed, the rest of their jobs will be blocked. The jobs will be released as the running jobs complete.
  • If a user sets up dependencies among jobs or conditions that have to be met before a job can run, the jobs will be blocked until the dependencies or conditions are met.
  • You can place a hold on your own job using qhold jobid.
  • In rare cases, an error in the batch sysetm will cause a job to be blocked with state “BatchHold”. If you see one of your jobs in this state, contact OSC Help for assistance.

To list blocked jobs:

showq -b
showq -b -u usr1234

showstart

The showstart command gives an estimate for the start time of a job. Unfortunately, these estimates are not at all accurate except for the highest priority job in the queue. If the time shown is exactly midnight two or three days in the future, it is meaningless. Otherwise the estimate may be off by a large amount in either direction.

Example:

showstart 123456

Why isn’t my job running?

There are many reasons that your job may have to wait in the queue longer than you would like. Here are some of them.

  • System load is high. It’s frustrating for everyone!
  • A system downtime has been scheduled and jobs are being held. Check the message of the day, which is displayed every time you login, or the system notices posted on ARMSTRONG.
  • You or your group have used a lot of resources in the last few days, causing your job priority to be lowered (“fairness policy”).
  • You or your group are at the maximum processor count or running job count and your job is being held.
  • Your project has a large negative RU (resource unit) balance.
  • Your job is requesting specialized resources, such as large memory or certain software licences, that are in high demand.
  • Your job is requesting a lot of resources. It takes time for the resources to become available.
  • Your job is requesting incompatible or nonexistent resources and can never run.
  • Your resource requests unnecessarily restrict the nodes where the job can run, for example by requesting mem=25GB on a system where most of the nodes have 24GB.
  • Job is unnecessarily stuck in batch hold because of system problems (very rare!).

Priority, backfill, and debug reservations

Priority is a complicated function of many factors, including the processor count and walltime requested, the length of time the job has been waiting, and how much other computing has been done by the user and their group over the last several days.

During each scheduling iteration, the scheduler will identify the highest priority job that cannot currently be run and find a time in the future to reserve for it. Once that is done, the scheduler will then try to backfill as many lower priority jobs as it can without affecting the highest priority job's start time. This keeps the overall utilization of the system high while still allowing reasonable turnaround time for high priority jobs. Short jobs and jobs requesting few resources are the easiest to backfill.

A small number of nodes are set aside during the day for jobs with a walltime limit of 1 hour or less, primarily for debugging purposes.

Observing a running job

You can monitor a running batch job almost as easily as you can monitor a program running interactively. The qpeek command allows you to see the output that would normally appear on your display. The pdsh (on Oakley) or all (on Glenn) command allows you to monitor your job’s CPU and memory usage, among other things. These commands are run from the login node.

qpeek

A job’s stdout and stderr data streams, which normally show up on the screen, are written to log files. These log files are stored on a server until the job ends, so you can’t look at them directly. The qpeek command allows you to peek at their contents. If you used the PBS header line to join the stdout and stderr streams (#PBS -j oe), the two streams are combined in the output log.

Here are a few examples for job 123456.  You can use the -e flag with any of them to get the error log instead of the output log.  (This is not applicable if you used “#PBS -j oe”.)

To display the current contents of the output log (stdout) for job 123456:

qpeek 123456

To display the current contents of the error log (stderr) for job 123456:

qpeek -e 123456

To display just the beginning (“head”) of the output log for job 123456:

qpeek -h 123456

To display just the end (“tail”) of the output log for job 123456:

qpeek -t 123456

To display the end of the output log and keep listening (“tail -f”) – terminate with Ctrl-C:

qpeek -f 123456

pdsh or all

If you’re in the habit of monitoring your programs using top or ps or something similar, you may find the pdsh or all command helpful. pdsh stands for “Parallel Distributed Shell”. It lets you run a command in parallel on all the nodes assigned to your job, with the results displayed on your screen. It is primarily used with parallel jobs. pdsh is used primarily on Oakley; all is available only on Glenn.

Caution: The commands that you run should be quick and simple to avoid interfering with the job. This is especially true if your job is sharing a node with other jobs.

Two useful commands often used with pdsh are uptime, which displays system load, and free, which gives memory usage; see also the man pages for these commands. There are also options for top that make it usable with pdsh.

Since this is a parallel command, the output for the various nodes will appear in an unpredictable order.

Examples for job 123456 on Oakley:

pdsh -j 123456 uptime
pdsh -j 123456 free -mo
pdsh -j 123456 top -b -n 1 -u usr1234

Examples for job 987654 on Glenn:

all -j 987654 uptime
all -j 987654 free -mo
all -j 987654 top -b -n 1 -u usr1234

qstat

The qstat command provides information about CPU, memory, and walltime usage for running jobs. With the -a flag, it shows elapsed time (wall time) in hours and minutes. With no flag, it shows “Time Used”, an accounting metric, in hours, minutes, and seconds. With the -f flag, it shows resources used, with information aggregated across all the nodes the job is running on.

Examples:

qstat -a 123456
qstat -f 123456

Managing your jobs

Deleting a job

Situations may arise in which you want to delete one of your jobs from the PBS queue. Perhaps you set the resource limits incorrectly, neglected to copy an input file, or had incorrect or missing commands in the batch file. Or maybe the program is taking too long to run (infinite loop).

The PBS command to delete a batch job is qdel. It applies to both queued and running jobs.

Example:

qdel 123456

If you are unable to delete one of your jobs, it may be because of a hardware problem or system software crash. In this case you should contact OSC Help.

Altering a queued job

You can alter certain attributes of your job while it’s in the queue using the qalter command. This can be useful if you want to make a change without losing your place in the queue. You cannot make any alterations to the executable portion of the script, nor can you make any changes after the job starts running.

The syntax is:

qalter [options ...] jobid

The options argument consists of one or more PBS directives in the form of command-line options.

For example, to change the walltime limit on job 123456 to 5 hours and have email sent when the job ends (only):

qalter -l walltime=5:00:00 -m e 123456

Placing a hold on a queued job

If you want to prevent a job from running but leave it in the queue, you can place a hold on it using the qhold command. The job will remain blocked until you release it with the qrls command. A hold can be useful if you need to modify the input file for a job, for example, but you don’t want to lose your place in the queue.

Examples:

qhold 123456
qrls 123456

Job statistics

There are commands you can include in your batch script to collect job statistics or performance information.

date

The date command prints the current date and time. It can be informative to include it at the beginning and end of the executable portion of your script as a rough measure of time spent in the job.

time

The time utility is used to measure the performance of a single command. It can be used for serial or parallel processes. Add /usr/bin/time to the beginning of a command in the batch script:

/usr/bin/time myprog arg1 arg2

The result is provided in the following format:

  1. user time (CPU time spent running your program)
  2. system time (CPU time spent by your program in system calls)
  3. elapsed time (wallclock)
  4. % CPU used
  5. memory, pagefault and swap statistics
  6. I/O statistics

These results are appended to the job's error log file. Note: Use the full path “/usr/bin/time” to get all the information shown.

ja

The job accounting utility ja prints job accounting information inside a PBS job, including CPU time, memory, virtual memory, and walltime used. This information is also included in the email sent when a jobs ends (if email is requested). While the job is running, the same information is available with the qstat -f command.

Scheduling Policies and Limits

The batch scheduler is configured with a number of scheduling policies to keep in mind. The policies attempt to balance the competing objectives of reasonable queue wait times and efficient system utilization. The details of these policies differ slightly on Oakley and Glenn. Exceptions to the limits can be made under certain circumstances; contact oschelp@osc.edu for details.

Hardware limits

Oakley and Glenn differ in the number of processors (cores) and the amount of memory and disk they have per node. We commonly find jobs waiting in the queue that cannot be run on the system where they were submitted because their resource requests exceed the limits of the available hardware. Jobs never migrate between systems, so please pay attention to these limits.

Notice in particular the large number of standard nodes and the small number of large-memory nodes. Your jobs are likely to wait in the queue much longer for a large-memory node than for a standard node. Users often inadvertently request slightly more memory than is available on a standard node and end up waiting for one of the scarce large-memory nodes, so check your requests carefully.

This is a brief summary. Details about the available hardware can be found elsewhere in the documentation.

 

# of Nodes # of cores per node (ppn) Memory in GB (*approximate)

Temporary File Space (GB)

Oakley (standard) 685 12 48 812
Oakley (bigmem) 8 12 192 812
Glenn  635 8 24 392

On Oakley, 64 of the standard nodes have 2 gpus each. On Glenn, 32 of the available newdual nodes have 2 gpus each.

* The actual amount of memory you can request in GB is slightly less than the nominal amount shown. It is safest to request memory in MB, for example, 8000MB instead of 8GB. (1GB is interpreted as 1024MB.)

Walltime limits per job

Serial jobs (that is, jobs which request only one node) can run for up to 168 hours, while parallel jobs may run for up to 96 hours.

Users who can demonstrate a need for longer serial job time may request access to the longserial queue, which allows single-node jobs of up to 336 hours. Longserial access is not automatic. Factors that will be considered include how efficiently the jobs use OSC resources and whether they can be broken into smaller tasks that can be run separately.

Limits per user and group

These limits are applied separately on Oakley and Glenn.

An individual user can have up to 128 concurrently running jobs and/or up to 2048 (2040 for Oakley) processor cores in use. All the users in a particular group/project can among them have up to 192 concurrently running jobs and/or up to 2048 (2040 for Oakley) processor cores in use. Jobs submitted in excess of these limits are queued but blocked by the scheduler until other jobs exit and free up resources.

A user may have no more than 1000 jobs submitted to the batch system at once. Jobs submitted in excess of this limit will be rejected.

Fair-share limits

To keep any one user or group/project from monopolizing the system when others need the same resources, the scheduler imposes what are known as fair-share limits. If a user or group/project uses large amounts of computing resources over a period of a few days, any new jobs they submit during that period will have reduced priority.

Projects with large negative RU balances

Projects that have used all their allocated resources may have further restrictions placed on their ability to run jobs. The project PI will receive multiple emails prior to having restrictions put in place. Restrictions may be lifted by submitting a proposal for additional HPC resources.

Priority

The priority of a job is influenced by a large number of factors, including the processor count requested, the length of time the job has been waiting, and how much other computing has been done by the user and their group over the last several days. However, having the highest priority does not necessarily mean that a job will run immediately, as there must also be enough processors and memory available to run it.

Short jobs for debugging

A small number of nodes are set aside during the day for jobs with a walltime limit of 1 hour or less. Please remember to exit debug jobs when you are done using the resources, to free them up for other users.

GPU Jobs

All GPU nodes are reserved for jobs that request gpus. Short non-GPU jobs are allowed to backfill on these nodes to allow for better utilization of cluster resources.

PBS Directives Summary

PBS directives may appear as header lines in a batch script or as options on the qsub command line. They specify the resource requirements of your job and various other attributes. Many of the directives are discussed in more detail elsewhere in this document. The online manual page for qsub (man qsub) describes many of them.

PBS header lines must come before any executable lines in your script. Their syntax is:

#PBS [option]

where option is one of the options in the table below. For example, to request 4 nodes with 12 processors per node:

#PBS -l nodes=4:ppn=12

The syntax for including an option on the qsub command line is:

qsub [option]

For example, the following line submits the script myscript.job but adds the -l nodes directive:

qsub -l nodes=4:ppn=12 myscript.job

Notes: The “-l” flag is an “el” (for ”limit”), not a “one”. There are no spaces around the “=” and “:” signs.

Option Desctiption Example
-l walltime=hh:mm:ss Requests the amount of time needed for the job. Default is one hour. #PBS -l walltime=10:00:00
-l nodes=n:ppn=p Requests number of nodes and processors per node. The range of ppn values depends on the hardware you are running on.  Default is one processor on one node.

#PBS -l nodes=2:ppn=12

#PBS -l nodes=3:ppn=8

-l nodes=n:ppn=p:gpus=g Requests number of gpus per node in addition to the node and ppn requests. This syntax applies to Oakley only.

#PBS -l nodes=1:ppn=12:gpus=2

-l nodes=n:ppn=12:gpus=2:vis Requests a visualization job on Oakley. The batch system will start an X server with access to the GPUs. #PBS -l nodes=1:ppn=12:gpus=2:vis
-l nodes=n:ppn=8:gpu Requests a gpu node. Must request ppn=8. This syntax applies to Glenn only. #PBS -l nodes=1:ppn=8:gpu
-l nodes=n:ppn=p:feature Requests a node with a particular feature. #PBS -l nodes=1:ppn=8:newdual
-l mem=amount Request the total amount of memory needed across all nodes. Default units are bytes; can also be expressed in megabytes (mem=4000MB) or gigabytes (mem=4GB). #PBS -l mem=24000MB
-l software=package[+N] Request use of N licenses for package. If omitted, N=1. Only required for jobs using specific software packages with limited numbers of licenses; see software documentation for details. #PBS -l software=abaqus+5
-N jobname Sets the job name, which appears in status listings and is used as the prefix in the job’s output and error log files. The job name must not contain spaces. #PBS -N Test_job2
-j oe By default, PBS returns two log files, one for the standard output stream (stdout), the other for the standard error stream (stderr). This option joins both into a single log file. #PBS -j oe

-m [a][b][e]

-m n

Use any combination of the letters a, b, and e; do not include the brackets. Requests a status email when the job begins (b), ends (e), or aborts (a). The n option requests no email, but you'll still get email if the job aborts.

#PBS -m abe    for lots of email

#PBS -m n       for minimal email

-o filename Renames the output log file. #PBS -o test.out
-I Requests an interactive batch job. qsub -I
-X Enables X11 forwarding. Useful primarily in interactive batch jobs. qsub -I -X
-l file=amount Request nodes with temporary file space of amount specified. This should be used only when requesting special nodes on Glenn. #PBS -l file=1000gb
-S /bin/shell Sets the Linux shell to be used used in executing your batch script. You should normally leave this out; it defaults to your normal login shell. The most common shells are /bin/csh, /bin/bash, /bin/ksh, /bin/tcsh. #PBS -S /bin/bash
-a [YYYY][MM][DD]hhmm Delay executing batch job until a particular date and time. Hours and minutes are required; year, month and day are optional. No spaces are allowed in the numeric string.

#PBS -a 1700

(Run the job after 5pm.)

-W depend=afterany:jobid This job may be scheduled for execution only after job jobid has terminated. #PBS -W depend=afterany:123456

 

Batch Environment Variable Summary

The batch system provides several environment variables that you may want to use in your job script. This section is a summary of the most useful of these variables. Many of them are discussed in more detail elsewhere in this document. The ones beginning with PBS_ are described in the online manual page for qsub (“man qsub”).

Environment Variable Description Comments
$TMPDIR The absolute path and name of the temporary directory created for this job on the local file system of each node Access to local files is much faster than access to shared file systems. You should copy your files to $TMPDIR and work locally whenever possible. This directory is deleted when your job ends.
$PFSDIR   The absolute path and name of the temporary directory created for this job on the parallel file system Use the parallel file system rather than your home or project directory for heavy I/O if you can’t use $TMPDIR. This directory is deleted when your job ends.
$PBS_O_WORKDIR The absolute path of the directory from which the batch script was started Your batch job begins execution in your home directory even if you submit the job from another directory.
$PBS_NODEFILE The absolute path and name of the file containing the list of nodes and processors assigned to the job Sometimes used in a batch script to determine the number of nodes and/or processors requested.
$PBS_GPUFILE The absolute path and name of the file containing the list of gpus assigned to the job Oakley only. Sometimes used to determine which GPU should be used if only one was requested.
$PBS_ARRAYID Unique identifier assigned to each member of a job array Used with qsub -t. See the discussion of job arrays elsewhere in this document.
$PBS_JOBID The job identifier assigned to the job by the batch system For example, 123456. May be used as part of a directory name, for example.
$PBS_JOBNAME The job name supplied by the user The job name may be assigned in the script using the -N header.

 

The following environment variables are often used in batch scripts but are not directly related to the batch system.

 

Environment Variable Description Comments
$OMP_NUM_THREADS The number of threads to be used in an OpenMP program See the discussion of OpenMP elsewhere in this document. Set in your script. Not all OpenMP programs use this value.
$MV2_ENABLE_AFFINITY Thread affinity option for MVAPICH2. Set this variable to 0 in your script if your program uses both MPI and multithreading. Not needed with MPI-1.
$HOME The absolute path of your home directory. Use this variable to avoid hard-coding your home directory path in your script.

 

Batch-Related Command Summary

This section summarizes two groups of batch-related commands: commands that are run on the login nodes to manage your jobs and commands that are run only inside a batch script. Only the most common options are described here.

Many of these commands are discussed in more detail elsewhere in this document. All have online manual pages (example: man qsub) unless otherwise noted.

In describing the usage of the commands we use square brackets [like this] to indicate optional arguments. The brackets are not part of the command.

Important note: The batch systems on Oakley and Glenn are entirely separate. Be sure to submit your jobs on a login node for the system you want them to run on. All monitoring while the job is queued or running must be done on the same system also. Your job output, of course, will be visible from both systems.

Commands for managing your jobs

These commands are typically run from a login node to manage your batch jobs. The batch systems on Oakley and Glenn are completely separate, so the commands must be run on the system where the job is to be run.

qsub

The qsub command is used to submit a job to the batch system.

Usage Desctiption Example
qsub [options] script Submit a script for a batch job. The options list is rarely used but can augment or override the directives in the header lines of the script.   qsub sim.job
qsub -t array_request [options] jobid Submit an array of jobs qsub -t 1-100 sim.job
qsub -I [options] Submit an interactive batch job qsub -I -l nodes=1:ppn=



qstat

The qstat command is used to display the status of batch jobs.

Usage Desctiption Example
qstat Display all jobs currently in the batch system. qstat
qstat [-a] jobid Display information about job jobid. The -a flag uses an alternate format. qstat -a 123456
qstat -f jobid Display full status information about job jobid. qstat -f 123456
qstat -u username [-f] Display information about all the jobs belonging to user username. qstat -u usr1234

qdel

The qdel command may be used to delete a queued or running job.

Usage Description Example
qdel jobid Delete job jobid. qdel 123456

qpeek

The qpeek command may be used to look at the output log file (stdout) or error log file (stderr) of a running job.

Usage Description Example
qpeek jobid Display the current contents of the output log file (stdout) for job jobid. qpeek 1234567
qpeek -e jobid Display the current contents of the error log file (stderr) for job jobid. qpeek -e 1234567
qpeek -h [-e] jobid Display just the beginning (“head”) of the file. qpeek -h 123456
qpeek -t [-e] jobid Display just the end (“tail”) of the file. qpeek -t 123456
qpeek -f [-e] jobid Display the end of the file and keep listening (“tail -f”). qpeek -f 123456

qalter

The qalter command may be used to modify the attributes of a queued (not running) job. Not all attributes can be altered.

Usage Description Example
qalter [option] jobid Alter one or more attributes a queued job. The options you can modify are a subset of the directives that can be used when submitting a job. qalter -l mem=47gb 123456

qhold, qrls

The qhold command allows you to place a hold on a queued job. The job will be prevented from running until you release the hold with the qrls command.

Usage Description Example
qhold jobid Place a user hold on job jobid. qhold 123456
qrls jobid Release a user hold previously placed on job jobid. qrls 123456

showstart

The showstart command tries to estimate when a queued job will start running. It is extremely unreliable, often making large errors in either direction.

Usage Description Example
showstart jobid Display estimate of start time. showstart 123456

showq

The showq command lists jobs from the point of view of the Moab scheduler.

Usage Description Example
showq List all jobs currently in the batch system. showq
showq -i List idle jobs that are eligible to run. showq -i
showq -r List running jobs. showq -r
showq -b List blocked jobs. showq -b
showq -u username List all jobs belonging to user username. showq -u usr1234

pdsh or all

The pdsh (on Oakley) or all (on Glenn) command can be used to monitor a running job by executing a command on all the nodes assigned to the job and returning  the results. It is primarily used with parallel jobs. The commands that are run should be quick and simple to avoid interfering with the job. Two useful commands used with pdsh or all are uptime, which displays system load, and free, which gives memory usage; see also the man pages for these commands.

Usage Description Example
pdsh -j jobid cmd Run cmd on all the nodes on which jobid is running. Oakley only.

pdsh -j 123456 uptime

pdsh -j 123456 free -m

all -j jobid cmd Run cmd on all the nodes on which jobid is running. Glenn only.

all -j 123456 uptime

all -j 123456 free -m

Commands used only inside a batch job

These commands can only be used inside a batch job.

mpiexec

Use the mpiexec command to run a parallel program or to run multiple processes simultaneously within a job. It is a replacement program for the script mpirun, which is part of the mpich package.

The OSC version of mpiexec is customized to work with our batch environment. There are other mpiexec programs in existence, but it is imperative that you use the one provided with our system.

Usage Description Example
mpiexec progname [args] Run the executable program progname in parallel, with as many processes as there are processors (cores) assigned to the job (nodes*ppn).

mpiexec myprog

mpiexec yourprog abc.dat 123

mpiexec -pernode progname [args] Run only one process per node. mpiexec -pernode myprog
mpiexec -npernode num progname [args] Run the specified number of processes on each node. mpiexec -npernode 3 myprog
mpiexec -tv [options] progname [args] Run the program with the TotalView parallel debugger.

mpiexec -tv myprog

mpiexec -n num progname [args]

mpiexec -np num progname [args] Run only the specified number of processes. (-n and -np are equivalent.) Does not spread processes out evenly across nodes. mpiexec -n 3 myprog

pbsdcp

The pbsdcp command is a distributed copy command for the PBS environment. It copies files to or from each node of the cluster assigned to your job. This is needed when copying files to directories which are not shared between nodes, such as $TMPDIR.

Options are -r for recursive and -p to preserve modification times and modes.

Usage Description Example
pbsdcp [-s] [options] srcfiles  target “Scatter”. Copy one or more files from shared storage to the target directory on each node (local storage). The -s flag is optional.

pbsdcp -s infile1 infile2 $TMPDIR

pbsdcp model.* $TMPDIR

pbsdcp -g [options] srcfiles  target “Gather”. Copy the source files from each node to the shared target directory. Wildcards must be enclosed in quotes. pbsdcp -g '$TMPDIR/outfile*' $PBS_O_WORKDIR

Note: In gather mode, if files on different nodes have the same name, they will overwrite each other. In the -g example above, the file names may have the form outfile001, outfile002, etc., with each node producing a different set of files.

ja

The ja command prints job accounting information from inside a PBS job. This includes aggregate CPU time, memory, virtual memory, and walltime. Note: The same information is available from qstat -f while the job is running.

Usage Description Example
ja Print job accounting information inside a PBS job. ja

 

Troubleshooting Batch Problems

License problems

If you get a license error when you try to run a third-party software application, it means either the licenses are all in use or you’re not on the access list for the license. Very rarely there could be a problem with the license server. You should read the software page for the application you’re trying to use and make sure you’ve complied with all the procedures and are correctly requesting the license. Contact OSC Help with any questions.

My job is running slower than it should

Here are a few of the reasons your job may be running slowly:

  • Your job has exceeded available physical memory and is swapping to disk. This is always a bad thing in an HPC environment as it can slow down your job dramatically. Either cut down on memory usage, request more memory, or spread a parallel job out over more nodes.
  • Your job isn’t using all the nodes and/or cores you intended it to use. This is usually a problem with your batch script.
  • Your job is spawning more threads than the number of cores you requested. Context switching involves enough overhead to slow your job.
  • You are doing too much I/O to the network file servers (home and project directories), or you are doing an excessive number of small I/O operations to the parallel file server. An I/O-bound program will suffer severe slowdowns with improperly configured I/O.
  • You didn’t optimize your program sufficiently.
  • You got unlucky and are being hurt by someone else’s misbehaving job. As much as we try to isolate jobs from each other, sometimes a job can cause system-level problems. If you have run your job before and know that it usually runs faster, OSC staff can check for problems.

Someone deleted my job!

If your job is misbehaving, it may be necessary for OSC staff to delete it. Common problems are using up all the virtual memory on a node or performing excessive I/O to a network file server. If this happens you will be contacted by OSC Help with an explanation of the problem and suggestions for fixing it. We appreciate your cooperation in this situation because, much as we try to prevent it, one user’s jobs can interfere with the operation of the system.

Occasionally a problem not caused by your job will cause an unrecoverable situation and your job will have to be deleted. You will be contacted if this happens.

Why can’t I delete my job?

If you can’t delete your job, it usually means a node your job was running on has crashed and the job is no longer running. OSC staff will delete the job.

My job is stuck.

There are multiple reasons that your job may appear to be stuck. If a node that your job is running on crashes, your job may remain in the running job queue long after it should have finished. In this case you will be contacted by OSC and will probably have to resubmit your job.

If you conclude that your job is stuck based on what you see in qpeek, it’s possible that the problem is an illusion. This comment applies primarily to code you develop yourself. If you print progress information, for example, “Input complete” and “Setup complete”, the output may be buffered for efficiency, meaning it’s not written to disk immediately, so it won’t show up in qpeek. To have it written immediate, you’ll have to flush the buffer; most programming languages provide a way to do this.

My job crashed. Can I recover my data?

If your job failed due to a hardware failure or system problem, it may be possible to recover your data from $TMPDIR. If the failure was due to hitting the walltime limit, the data in $TMPDIR would have been deleted immediately. Contact OSC Help for more information.

The trap command can be used in your script to save your data in case your job terminates abnormally.

Contacting OSC Help

If you are having a problem with the batch system on any of OSC's machines, you should send email to oschelp@osc.edu. Including the following information will assist HPC Client Services staff in diagnosing your problem quickly:

  1. Name
  2. OSC User ID (username)
  3. Name of the system you are using (Oakley or Glenn)
  4. Job ID
  5. Job script
  6. Job output and/or error messages (preferably in context)