System Email

Occasionally, jobs that experience problems may generate emails from staff or automated systems at the center with some information about the nature of the problem. These pages provide additional information about the various emails sent, and steps that can be taken to address the problem.

Batch job aborted

Purpose

Notify you when your job terminates abnormally.

Sample subject line

PBS JOB 944666.oak-batch.osc.edu

Apparent sender

  • root <adm@oak-batch.osc.edu> (Oakley)
  • root <pbs-opt@hpc.osc.edu> (Glenn)

Sample contents

PBS Job Id: 935619.oak-batch.osc.edu
Job Name:   mailtest.job
Exec host:  n0587/5
Aborted by PBS Server
Job exceeded some resource limit (walltime, mem, etc.). Job was aborted See Administrator for help

Sent under these circumstances

These are fully automated emails send by the batch system.

Some reasons a job might terminate abnormally:

  • The job exceeded its allotted walltime, memory, virtual memory, or other limited resource. More information is available in your job log file, e.g., jobname.o123456.
  • An unexpected system problem caused your job to fail.

To turn off the emails

There is no way to turn them off at this time.

To prevent these problems

For advice on monitoring and controlling resource usage, see Monitoring and Managing Your Job.

There’s not much you can do about system failures, which fortunately are rare.

Notes

Under some circumstances you can retrieve your job output log if your job aborts due to a system failure. Contact oschelp@osc.edu for assistance.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Batch job begin or end

Purpose

Notify you when your job begins or ends.

Sample subject line

PBS JOB 944666.oak-batch.osc.edu

Apparent sender

  • root <adm@oak-batch.osc.edu> (Oakley)
  • root <pbs-opt@hpc.osc.edu> (Glenn)

Sample contents

PBS Job Id: 944666.oak-batch.osc.edu
Job Name:   mailtest.job
Exec host:  n0587/1
Begun execution
 
PBS Job Id: 944666.oak-batch.osc.edu
Job Name:   mailtest.job
Exec host:  n0587/1
Execution terminated
Exit_status=0
resources_used.cput=00:00:00
resources_used.mem=2228kb
resources_used.vmem=211324kb
resources_used.walltime=00:01:00

Sent under these circumstances

These are fully automated emails sent by the batch system. You control them through the headers in your job script. The following line requests emails at the beginning, ending, and abnormal termination of your job.

#PBS -m abe

To turn off the emails

Remove the -m option from your script and/or command line or use -m n. See PBS Directives Summary.

Notes

You can add the following command at the end of your script to have resource information written to your job output log:

ja

For more information

See PBS Directives Summary.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Batch job deleted by an administrator

Purpose

Notify you when your job is deleted by an administrator.

Sample subject line

PBS JOB 9657213.opt-batch.osc.edu

Apparent sender

  • root adm@oak-batch.osc.edu (Oakley)
  • root pbs-opt@hpc.osc.edu (Glenn)

Sample contents

PBS Job Id: 9657213.opt-batch.osc.edu
Job Name:   mailtest.job
job deleted
Job deleted at request of staff@opt-login04.osc.edu Job using too much memory. Contact oschelp@osc.edu.

Sent under these circumstances

These emails are sent automatically, but the administrator can add a note with the reason.

Some reasons a running job might be deleted:

  • The job is using so much memory that it threatens to crash the node it is running on.
  • The job is using more resources than it requested and is interfering with other jobs running on the same node.
  • The job is causing excessive load on some part of the system, typically a network file server.
  • The job is still running at the start of a scheduled downtime.

Some reasons a queued job might be deleted:

  • The job requests non-existent resources.
  • A job apparently intended for Oakley (ppn=12) was submitted on Glenn.
  • The job can never run because it requests combinations of resources that are disallowed by policy.
  • The user’s credentials are blocked on the system the job was submitted on.

To turn off the emails

There is no way to turn them off at this time.

To prevent these problems

See the Supercomputing FAQ for suggestions on dealing with specific problems.

For assistance

We will work with you to get your jobs running within the constraints of the system. Contact OSC Help for assistance. See our Support Services page for more contact information.

Emails exceeded the expected volume

Purpose

Notify you that we have placed a hold on emails sent to you from the HPC system.

Sample subject line

Emails sent to email address student@buckeyemail.osu.edu in the last hour exceeded the expected volume

Apparent sender

OSC Help <OSCHelp@osc.edu>

Explanation

When a job fails or is deleted by an administrator, the system sends you an email. If this happens with a large number of jobs, it generates a volume of email that may be viewed as spam by your email provider. To avoid having OSC blacklisted, and to avoid overloading your email account, we hold your emails from OSC.

Please note that these held emails will eventually be deleted if you do not contact us.

Sent under these circumstances

These emails are sent automatically when your email usage from OSC is deferred.

To turn off the emails

Turn off emails related to your batch jobs to reduce your overall email volume from OSC. See the -m option on the PBS Directives Summary page.

Notes

To re-enable email you must contact OSC Help.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

 

 

File system load problem

Purpose

Notify you that one or more of your jobs caused excessive load on one of the network file system directory servers.

Sample subject line

Your jobs on Oakley are causing excessive load on fs14

Apparent sender

OSC Help <OSCHelp@osc.edu> or an individual staff member

Explanation

Your jobs are causing problems with one of the network file servers. This is usually caused by submitting a large number of jobs that start at the same time and execute in lockstep.

Sent under these circumstances

These emails are sent by a staff member when the high load is traced to your jobs. Often the jobs have to be stopped or deleted.

To turn off the emails

You cannot turn off these emails. Please don’t ignore them because they report a problem that you must correct.

To prevent these problems

See the Knowledge Base article (coming soon) for suggestions on dealing with file system load problems.

For information on the different file systems available at OSC, see Available File Systems.

Notes

If you continue to submit jobs that cause these problems, your HPC account may be blocked.

For assistance

We will work with you to get your jobs running within the constraints of the system. Contact OSC Help for assistance. See our Support Services page for more contact information.

Job failure due to a system hardware problem

Purpose

Notify you that one or more of your jobs was running on a compute node that crashed due to a hardware problem.

Sample subject line

Failure of job(s) 919137 due to a hardware problem at OSC

Apparent sender

OSC Help <OSCHelp@osc.edu>

Explanation

Your job failed and was not at fault. You should resubmit the job.

Sent under these circumstances

These emails are sent by a systems administrator after a node crashes.

To turn off the emails

We don’t have a mechanism to turn off these emails. If they really bother you, contact OSC Help and we’ll try to accommodate you.

To prevent these problems

Hardware crashes are quite rare and in most cases there’s nothing you can do to prevent them. Certain types of bus errors on Glenn correlate strongly with certain applications (suggesting that they’re not really hardware errors). If you encounter this type of error you may be advised to use Oakley rather than Glenn.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Job failure due to a system software problem

Purpose

Notify you that one or more of your jobs was running on a compute node that crashed due to a system software problem.

Sample subject line

Failure of job(s) 919137 due to a system software problem at OSC

Apparent sender

OSC Help <OSCHelp@osc.edu>

Explanation

Your job failed and was not at fault. You should resubmit the job. Usually the problems are caused by another job running on the node.

Sent under these circumstances

These emails are sent by a systems administrator as part of the node cleanup process.

To turn off the emails

We don’t have a mechanism to turn off these emails. If they really bother you, contact OSC Help and we’ll try to accommodate you.

To prevent these problems

If you request a whole node (nodes=1:ppn=12 on Oakley or nodes=1:ppn=8 on Glenn) your jobs will be less susceptible to problems caused by other jobs. Other than that, be assured that we work hard to keep jobs from interfering with each other.

For assistance

Contact OSC Help. See our Support Services page for more contact information.

Job failure due to exhaustion of physical memory

Purpose

Notify you that one or more of your jobs caused compute nodes to crash with an out-of-memory error.

Sample subject line

Failure of job(s) 933014,933174 at OSC due to exhaustion of physical memory

Apparent sender

OSC Help <oschelp@osc.edu>

Explanation

Your job(s) exhausted both physical memory and swap space during job execution. This failure caused the compute node(s) used by the job(s) to crash, requiring a reboot.

Sent under these circumstances

These emails are sent by a systems administrator as part of the node cleanup process.

To turn off the emails

You cannot turn off these emails. Please don’t ignore them because they report a problem that you must correct.

To prevent these problems

See the Knowledge Base article "Out-of-Memory (OOM) or Excessive Memory Usage" for suggestions on dealing with out-of-memory problems.

For information on the memory available on the various systems, see our Supercomputing page.

Notes

If you continue to submit jobs that cause these problems, your HPC account may be blocked.

For assistance

We will work with you to get your jobs running within the constraints of the system. Contact OSC Help for assistance. See our Support Services page for more contact information.