If you get a license error when you try to run a third-party software application, it means either the licenses are all in use or you’re not on the access list for the license. Very rarely there could be a problem with the license server. You should read the software page for the application you’re trying to use and make sure you’ve complied with all the procedures and are correctly requesting the license. Contact OSC Help with any questions.
Here are a few of the reasons your job may be running slowly:
If your job is misbehaving, it may be necessary for OSC staff to delete it. Common problems are using up all the virtual memory on a node or performing excessive I/O to a network file server. If this happens you will be contacted by OSC Help with an explanation of the problem and suggestions for fixing it. We appreciate your cooperation in this situation because, much as we try to prevent it, one user’s jobs can interfere with the operation of the system.
Occasionally a problem not caused by your job will cause an unrecoverable situation and your job will have to be deleted. You will be contacted if this happens.
If you can’t delete your job, it usually means a node your job was running on has crashed and the job is no longer running. OSC staff will delete the job.
There are multiple reasons that your job may appear to be stuck. If a node that your job is running on crashes, your job may remain in the running job queue long after it should have finished. In this case you will be contacted by OSC and will probably have to resubmit your job.
If you conclude that your job is stuck based on what you see in the slurm output file, it’s possible that the problem is an illusion. This comment applies primarily to code you develop yourself. If you print progress information, for example, “Input complete” and “Setup complete”, the output may be buffered for efficiency, meaning it’s not written to disk immediately, so it won’t show up. To have it written immediate, you’ll have to flush the buffer; most programming languages provide a way to do this.
If your job failed due to a hardware failure or system problem, it may be possible to recover your data from $TMPDIR
. If the failure was due to hitting the walltime limit, the data in $TMPDIR
would have been deleted immediately. Contact OSC Help for more information.
The trap
command can be used in your script to save your data in case your job terminates abnormally.
If you are having a problem with the batch system on any of OSC's machines, you should send email to oschelp@osc.edu. Including the following information will assist HPC Client Services staff in diagnosing your problem quickly:
Or use the support request page.