Batch Tutorial

Description

This tutorial guides you through the process of creating and submitting a batch script on one of our compute clusters. This is a linux tutorial which uses batch scripting as an example, not a tutorial on writing batch scripts. The primary goal is not to teach you about batch scripting, but for you to become familiar with certain linux commands. There are other pages on the OSC web site that go into the details of submitting a job with a batch script.

Prerequisites

  • Familiarity with a text editor (emacs, nano, vim)
  • Basic understanding of the unix command line

Goals

  • Create subdirectories to organize information
  • Create a batch script with a text editor
  • Submit a job
  • Check on the progress of the job
  • Change the permissions of the output files
  • Get familiar with some common unix commands

Step 1 - Organize your directories

When you first log in to Glenn or Oakley, you are in your home directory. For the purposes of this illustration, we will pretend you are user osu0001 and your nfs file server number is 01, but when you try out commands you must use your own username and nfs file server number.

$ pwd
/nfs/01/osu0001
 
Note: you will see your user name and a different number after the /nfs.
 
It's a good idea to organize your work into separate directories. If you have used Windows or the Mac operating system, you may think of these as folders. Each folder may contain files and subfolders. The subfolders may contain other files and subfolders of their own. In linux we use the term "directory" instead of "folder." Use directories to organize your work.
 
Type the following four lines and take note of the output after each one:
 
$ touch foo1
$ touch foo2
$ ls
$ ls -l
$ ls -lt
$ ls -ltr
 
The "touch" command just creates an empty file with the name you give it.
You probably already know that the ls command shows the contents of the current working directory; that is, the directory you see when you type pwd. But what is the point of the "-l", "-lt" or "-ltr"? You noticed the difference in the output between just the "ls" command and the "ls -l" command.
Most unix commands have options you can specify that change the way the command works. The options can be specified by the "-" (minus sign) followed by a single letter. "ls -ltr" is actually specifying three options to the ls command.
l: I want to see the output in long format -- one file per line with some interesting information about each file
t: sort the display of files by when they were last modified, most-recently modified first
r: reverse the order of display (combined with -t this displays the most-recently modified file last -- it should be BatchTutorial in this case.)
 
I like using "ls -ltr" because I find it convenient to see the most recently modified file at the end of the list.
 
Now try this:
$ mkdir BatchTutorial
$ ls -ltr
 
The "mkdir" command makes a new directory with the name you give it. This is a subfolder of the current working directory. The current working directory is where your current focus is in the hierarchy of directories. The 'pwd' command shows you are in your home directory:
 
$ pwd
/nfs/01/osu0001
 
Now try this:
$ cd BatchTutorial
$ pwd
 
What is the output of 'pwd' now? "cd" is short for "change directory" -- think of it as moving you into a different place in the hierarchy of directories. Now do
$ cd ..
$ pwd
Where are you now?

Step 2 -- Get familiar with some more unix commands

Try the following:

$ echo where am I?
$ echo I am in `pwd`
$ echo my home directory is $HOME
$ echo HOME
$ echo this directory contains `ls -l`

These examples show what the echo command does and how to do some interesting things with it. The `pwd` means the result of issuing the command pwd. HOME is an example of an environment variable. These are strings that stand for other strings. HOME is defined when you log in to a unix system. $HOME means the string the variable HOME stands for. Notice that the result of "echo HOME" does not do the substitution. Also notice that the last example shows things don't always get formatted the way you would like.

Some more commands to try:

$ cal
$ cal > foo3
$ cat foo3
$ whoami
$ date

Using the ">" after a command puts the output of the command into a file with the name you specify. The "cat" command prints the contents of a file to the screen.

Two very important UNIX commands are the cp and mv commands. Assume you have a file called foo3 in your current directory created by the "cal > foo3" command. Suppose you want to make a copy of foo3 called foo4. You would do this with the following command:

$ cp foo3 foo4
$ ls -ltr

Now suppose you want to rename the file 'foo4' to 'foo5'. You do this with:

$ mv foo4 foo5
$ ls -ltr

'mv' is short for 'move' and it is used for renaming files. It can also be used to move a file to a different directory.

$ mkdir CalDir
$ mv foo5 CalDir
$ ls
$ ls CalDir

Notice that if you give a directory with the "ls" command is shows you what is in that directory rather than the current working directory.

Now try the following:

$ ls CalDir
$ cd CalDir
$ ls
$ cd ..
$ cp foo3 CalDir
$ ls CalDir

Notice that you can use the "cp" command to copy a file to a different directory -- the copy will have the same name as the original file. What if you forget to do the mkdir first?

$ cp foo3 FooDir

Now what happens when you do the following:

$ ls FooDir
$ cd FooDir
$ cat CalDir
$ cat FooDir
$ ls -ltr

CalDir is a directory, but FooDir is a regular file. You can tell this by the "d" that shows up in the string of letters when you do the "ls -ltr". That's what happens when you try to cp or mv a file to a directory that doesn't exist -- a file gets created with the target name. You can imagine a scenario in which you run a program and want to copy the resulting files to a directory called Output but you forget to create the directory first -- this is a fairly common mistake.

Step 3 -- Environment Variables

Before we move on to creating a batch script, you need to know more about environment variables. An environment variable is a word that stands for some other text. We have already seen an example of this with the variable HOME. Try this:

$ MY_ENV_VAR="something I would rather not type over and over"
$ echo MY_ENV_VAR
$ echo $MY_ENV_VAR
$ echo "MY_ENV_VAR stands for $MY_ENV_VAR"

You define an environment variable by assigning some text to it with the equals sign. That's what the first line above does. When you use '$' followed by the name of your environment variable in a command line, UNIX makes the substitution. If you forget the '$' the substitution will not be made.

There are some environment variables that come pre-defined when you log in to Glenn or Oakley. Try using 'echo' to see the values of the following variables: HOME, HOSTNAME, SHELL, TERM, PATH.

Now you are ready to use some of this unix knowledge to create and run a script.

Step 4 -- Create and run a script

Before we create a batch script and submit it to a compute node, we will do something a bit simpler. We will create a regular script file that will be run on the login node. A script is just a file that consists of unix commands that will run when you execute the script file. It is a way of gathering together a bunch of commands that you want to execute all at once. You can do some very powerful things with scripting to automate tasks that are tedious to do by hand, but we are just going to create a script that contains a few commands we could easily type in. This is to help you understand what is happening when you submit a batch script to run on a compute node.

Use a text editor to create a file named "tutorial.sh" which contains the following text (note that with emacs or nano you can use the mouse to select text and then paste it into the editor with the middle mouse button):

echo ----
echo Job started at `date`
echo ----
echo This job is working on node `hostname`

SH_WORKDIR=`pwd`
echo working directory is $SH_WORKDIR
echo ----
echo The contents of $SH_WORKDIR
ls -ltr
echo
echo ----
echo
echo creating a file in SH_WORKDIR
whoami > whoami-sh-workdir

SH_TMPDIR=${SH_WORKDIR}/sh-temp
mkdir $SH_TMPDIR
cd $SH_TMPDIR
echo ----
echo TMPDIR IS `pwd`
echo ----
echo wait for 12 seconds
sleep 12
echo ----
echo creating a file in SH_TMPDIR
whoami > whoami-sh-tmpdir

# copy the file back to the output subdirectory
cp ${SH_TMPDIR}/whoami-sh-tmpdir ${SH_WORKDIR}/output

cd $SH_WORKDIR

echo ----
echo Job ended at `date`

To run it:

$ chmod o+x tutorial.sh
$ ./tutorial.sh

Step 5 -- Create and run a batch job

Use your favorite text editor to create a file called tutorial.pbs in the BatchTutorial directory which has the following contents (remember, you can use the mouse to cut and paste text):

#PBS -l walltime=00:02:00
#PBS -l nodes=1:ppn=1
#PBS -N foobar
#PBS -j oe
#PBS -r n

echo ----
echo Job started at `date`
echo ----
echo This job is working on compute node `cat $PBS_NODEFILE`

cd $PBS_O_WORKDIR
echo show what PBS_O_WORKDIR is
echo PBS_O_WORKDIR IS `pwd`
echo ----
echo The contents of PBS_O_WORKDIR:
ls -ltr
echo
echo ----
echo
echo creating a file in PBS_O_WORKDIR
whoami > whoami-pbs-o-workdir

cd $TMPDIR
echo ----
echo TMPDIR IS `pwd`
echo ----
echo wait for 42 seconds
sleep 42
echo ----
echo creating a file in TMPDIR
whoami > whoami-tmpdir

# copy the file back to the output subdirectory
pbsdcp -g $TMPDIR/whoami-tmpdir $PBS_O_WORKDIR/output

echo ----
echo Job ended at `date`
 
To submit the batch script, type
$qsub tutorial.pbs
Use qstat -u [username] to check on the progress of your job. If you see something like this
$qstat -u osu0001

                                                                             Req'd  Req'd   Elap
Job ID             Username    Queue    Jobname          SessID NDS   TSK    Memory Time  S Time
------------------ ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
458842.oak-batch   osu0001     serial   foobar              --      1      1    --  00:02 Q   --
 
this means the job is in the queue -- it hasn't started yet. That is what the "Q" under the S column means.
 
If you see something like this:
                                                                             Req'd  Req'd   Elap
Job ID             Username    Queue    Jobname          SessID NDS   TSK    Memory Time  S Time
------------------ ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
458842.oak-batch   osu0001     serial   foobar            26276     1      1    --  00:02 R   --
this means the job is running and has job id 458842.
 
When the output of the qstat command is empty, the job is done.
 
After it is done, there should be a file called "foobar.o458842" in the directory.
Note that your file will end with a different number -- namely the job id number assigned to your job.
Check this with
$ ls -ltr
$ cat foobar.oNNNNNN

Where (NNNNNN is your job id).

The name of this file is determined by two things:
  1. The name you give the job in the script file with the header line #PBS -N foobar
  2. The job id number assigned to the job.

The name of the script file (tutorial.pbs) has nothing to do with the name of the output file.

Examine the contents of the output file foobar.oNNNNNN carefully. You should be able to see the results of some of the commands you put in tutorial.pbs. It also shows you the values of the variables PBS_NODEFILE, PBS_O_WORKDIR and TMPDIR. These variables exist only while your job is running. Try

$ echo $PBS_O_WORKDIR

and you will see it is no longer defined. $PBS_NODEFILE is a file which contains a list of all the nodes your job is running on. Because this script has the line

#PBS -l nodes=1:ppn=1

the contents of $PBS_NODEFILE is the name of a single compute node.

Notice that $TMPDIR is /tmp/pbstmp.NNNNNN (again, NNNNNN is the id number for this job.) Try

$ ls /tmp/pbstmp.NNNNNN

Why doesn't this directory exist? Because it is a directory on the compute node, not on the login node. Each machine in the cluster has its own /tmp directory and they do not contain the same files and subdirectories. All machines see the same /nfs/18/jeisenl but not the same /tmp. This goes for all the login and compute nodes on both Glenn and Oakley. Repeat: the /nfs directories are shared by all the nodes (login or compute) but each node has its own /tmp directory (as well as other unshared directories.)