Batch System Concepts

The only access to significant resources on the HPC machines is through the batch process.

Why use a batch system?

Access to the OSC clusters is through a system of login nodes. These nodes are reserved solely for the purpose of managing your files and submitting jobs to the batch system. Acceptable activities include editing/creating files, uploading and downloading files of moderate size, and managing your batch jobs. You may also compile and link small-to-moderate size programs on the login nodes.

CPU time and memory usage are severely limited on the login nodes. There are typically many users on the login nodes at one time. Extensive calculations would degrade the responsiveness of those nodes.

If a process is started on the login nodes that is using too much cpu or memory, then it may be killed without warning.

The batch system allows users to submit jobs requesting the resources (nodes, processors, memory, GPUs) that they need. The jobs are queued and then run as resources become available. The scheduling policies in place on the system are an attempt to balance the desire for short queue waits against the need for efficient system utilization.

Interactive vs. batch

When you type commands in a login shell and see a response displayed, you are working interactively. To run a batch job, you put the commands into a text file instead of typing them at the prompt. You submit this file to the batch system, which will run it as soon as resources become available. The output you would normally see on your display goes into a log file. You can check the status of your job interactively and/or receive emails when it begins and ends execution.

Terminology

The batch system used at OSC is SLURM. A central manager slurmctld, monitors resources and work. You’ll need to understand the terms cluster, node,  and processor (core) in order to request resources for your job. See HPC basics if you need this background information.

The words “parallel” and “serial” as used by SLURM can be a little misleading. From the point of view of the batch system a serial job is one that uses just one node, regardless of how many processors it uses on that node. Similarly, a parallel job is one that uses more than one node. More standard terminology considers a job to be parallel if it involves multiple processes.

Batch processing overview

Here is a very brief overview of how to use the batch system.

Choose a cluster

Before you start preparing a job script you should decide which cluster you want your job to run on, Owens or Pitzer. This decision will probably be based on the resources available on each system. Remember which cluster you’re using because the batch systems are independent.

Prepare a job script

Your job script is a text file that includes SLURM directives as well as the commands you want executed. The directives tell the batch system what resources you need, among other things. The commands can be anything you would type at the login prompt. You can prepare the script using any editor.

Submit the job

You submit your job to the batch system using the sbatch command, with the name of the script file as the argument. The sbatch command responds with the job ID that was given to your job, typically a 6- or 7-digit number.

Wait for the job to run

Your job may wait in the queue for minutes or days before it runs, depending on system load and the resources requested. It may then run for minutes or days. You can monitor your job’s progress or just wait for an email telling you it has finished.

Retrieve your output

The log file (screen output) from your job will be in the directory you submitted the job from by default. Any other output files will be wherever your script put them.

Supercomputer: