Search Documentation

Search Documentation

Overview

Estimating GPU memory (VRAM) usage for training or running inference with large deep learning models is critical to both 1. requesting the appropriate resources for running your computation and 2. optimizing your job once it is setup.  Out-of-memory (OOM) errors can be avoided by requesting appropriate resources and by better understanding memory usage during the job using memory profiling tools described here. 

 

Ascend, Cardinal, Owens, Pitzer

This page outlines how to use the Jupyter interactive app on OnDemand.

Launching Jupyter App

 

Log on to https://ondemand.osc.edu/ with your OSC credentials. Choose Jupyter under the InteractiveApps option. ondemand_home.jpeg

Cardinal, Pitzer

Rust is a general-purpose programming language with an emphasis on performance, type safety, and concurrency. It enforces memory safety without a traditional garbage collector, preventing data races and memory safety errors via the "borrow checker". The Rust module provides rustc and cargo.

Availability and Restrictions

Versions

The following versions of Rust are available on OSC clusters:

Cardinal, Owens, Pitzer

Rosetta

Cardinal

Hardware Specification

Below is a summary of the hardware information:

Cardinal

The Cardinal cluster is now running on Red Hat Enterprise Linux (RHEL) 9, introducing several software-related changes compared to the RHEL 7 environment used on the Owens and Pitzer clusters. These updates provide access to modern tools and libraries but may also require adjustments to your workflows. Key software changes and available software are outlined in the following sections.

Cardinal

Overview of the High Bandwidth Memory on Cardinal's Dense compute nodes

Cardinal

Compilers

The Cardinal cluster supports C, C++, and Fortran programming languages. The available compiler suites include Intel, oneAPI, and GCC. By default, the Intel development toolchain is loaded. The table below lists the compiler commands and recommended options for compiling serial programs. For more details and best practices, please refer to our compilation guide.

Cardinal

These are the public key fingerprints for Cardinal:

cardinal: ssh_host_rsa_key.pub = 73:f2:07:6c:76:b4:68:49:86:ed:ef:a3:55:90:58:1b
cardinal: ssh_host_ed25519_key.pub = 93:76:68:f0:be:f1:4a:89:30:e2:86:27:1e:64:9c:09
cardinal: ssh_host_ecdsa_key.pub = e0:83:14:8f:d4:c3:c5:6c:c6:b6:0a:f7:df:bc:e9:2e

PyTorch Fully Sharded Data Parallel (FSDP) is used to speed-up model training time by parallelizing training data as well as sharding model parameters, optimizer states, and gradients across multiple pytorch instances.

 

Pages