Scheduling Policies and Limits

The batch scheduler is configured with a number of scheduling policies to keep in mind. The policies attempt to balance the competing objectives of reasonable queue wait times and efficient system utilization. The details of these policies differ slightly on each system. Exceptions to the limits can be made under certain circumstances; contact oschelp@osc.edu for details.

Hardware limits

Each system differs in the number of processors (cores) and the amount of memory and disk they have per node. We commonly find jobs waiting in the queue that cannot be run on the system where they were submitted because their resource requests exceed the limits of the available hardware. Jobs never migrate between systems, so please pay attention to these limits.

Notice in particular the large number of standard nodes and the small number of large-memory nodes. Your jobs are likely to wait in the queue much longer for a large-memory node than for a standard node. Users often inadvertently request slightly more memory than is available on a standard node and end up waiting for one of the scarce large-memory nodes, so check your requests carefully.

See cluster computing for details on the number of nodes for each type.

Walltime limits per job

Serial jobs (that is, jobs which request only one node) can run for up to 168 hours, while parallel jobs may run for up to 96 hours.

Users who can demonstrate a need for longer serial job time may request access to the longserial queue, which allows single-node jobs of up to 336 hours. Longserial access is not automatic. Factors that will be considered include how efficiently the jobs use OSC resources and whether they can be broken into smaller tasks that can be run separately.

Limits per user and group

An individual user can only have a certain number of concurrently running jobs as well as a limited number of cores and GPU's that are being used simultaneously. These limits reduce the number of resources that a user can use, beyond the limit on the number of resources per job. These limits also apply to a group, though they are increased as it is anticipated that an entire group may need to utilize more resources.

To find the limits of the specific system that you are using, you can look up the Batch Limit Rules of your system. All jobs submitted in excess of these limits will be queued but will not be scheduled until other jobs have exited and freed the resources for the user or group.

A user may have no more than 1000 jobs submitted to both the parallel and serial job queue separately. Jobs submitted in excess of this limit will be rejected.

Priority

The priority of a job is influenced by a large number of factors, including the processor count requested, the length of time the job has been waiting, and how much other computing has been done by the user and their group over the last several days. However, having the highest priority does not necessarily mean that a job will run immediately, as there must also be enough processors and memory available to run it.

GPU Jobs

All GPU nodes are reserved for jobs that request gpus. Short non-GPU jobs are allowed to backfill on these nodes to allow for better utilization of cluster resources.

Supercomputer:

Ascend

Cardinal

Pitzer