Troubleshooting Batch Problems

License problems

If you get a license error when you try to run a third-party software application, it means either the licenses are all in use or you’re not on the access list for the license. Very rarely there could be a problem with the license server. You should read the software page for the application you’re trying to use and make sure you’ve complied with all the procedures and are correctly requesting the license. Contact OSC Help with any questions.

My job is running slower than it should

Here are a few of the reasons your job may be running slowly:

  • Your job has exceeded available physical memory and is swapping to disk. This is always a bad thing in an HPC environment as it can slow down your job dramatically. Either cut down on memory usage, request more memory, or spread a parallel job out over more nodes.
  • Your job isn’t using all the nodes and/or cores you intended it to use. This is usually a problem with your batch script.
  • Your job is spawning more threads than the number of cores you requested. Context switching involves enough overhead to slow your job.
  • You are doing too much I/O to the network file servers (home and project directories), or you are doing an excessive number of small I/O operations to the parallel file server. An I/O-bound program will suffer severe slowdowns with improperly configured I/O.
  • You didn’t optimize your program sufficiently.
  • You got unlucky and are being hurt by someone else’s misbehaving job. As much as we try to isolate jobs from each other, sometimes a job can cause system-level problems. If you have run your job before and know that it usually runs faster, OSC staff can check for problems.

Someone deleted my job!

If your job is misbehaving, it may be necessary for OSC staff to delete it. Common problems are using up all the virtual memory on a node or performing excessive I/O to a network file server. If this happens you will be contacted by OSC Help with an explanation of the problem and suggestions for fixing it. We appreciate your cooperation in this situation because, much as we try to prevent it, one user’s jobs can interfere with the operation of the system.

Occasionally a problem not caused by your job will cause an unrecoverable situation and your job will have to be deleted. You will be contacted if this happens.

Why can’t I delete my job?

If you can’t delete your job, it usually means a node your job was running on has crashed and the job is no longer running. OSC staff will delete the job.

My job is stuck.

There are multiple reasons that your job may appear to be stuck. If a node that your job is running on crashes, your job may remain in the running job queue long after it should have finished. In this case you will be contacted by OSC and will probably have to resubmit your job.

If you conclude that your job is stuck based on what you see in qpeek, it’s possible that the problem is an illusion. This comment applies primarily to code you develop yourself. If you print progress information, for example, “Input complete” and “Setup complete”, the output may be buffered for efficiency, meaning it’s not written to disk immediately, so it won’t show up in qpeek. To have it written immediate, you’ll have to flush the buffer; most programming languages provide a way to do this.

My job crashed. Can I recover my data?

If your job failed due to a hardware failure or system problem, it may be possible to recover your data from $TMPDIR. If the failure was due to hitting the walltime limit, the data in $TMPDIR would have been deleted immediately. Contact OSC Help for more information.

The trap command can be used in your script to save your data in case your job terminates abnormally.

Contacting OSC Help

If you are having a problem with the batch system on any of OSC's machines, you should send email to oschelp@osc.edu. Including the following information will assist HPC Client Services staff in diagnosing your problem quickly:

  1. Name
  2. OSC User ID (username)
  3. Name of the system you are using (Oakley or Glenn)
  4. Job ID
  5. Job script
  6. Job output and/or error messages (preferably in context)