Operations

Systemic Problem on Cluster Computing service

4:20PM 6/23/2017 Update: All HPC systems are back in production. This outage may cause failures of users' jobs. We'll update the community as more is known. 

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

3:40PM 6/23/2017 Update: All HPC systems are back in production except for scratch service (/fs/scratch). This outage may cause failures of users' jobs. We'll update the community as more is known. 

All HPC systems are available

8/24/16 3:57PM: All HPC systems are availalbe including:

  • Oakley cluster for general access
  • Ruby cluster for restricted access
  • Owens cluster for early users
  • Home directory and scratch file systems
  • OnDemand and other web portals
  • Project file system (/fs/project)

All jobs held before downtime have been released by the batch scheduler. If your jobs are still held or you have any questions, please contact oschelp@osc.edu

Some issues remain after downtime

15 July 2016, 5:00PM update: some additional issues we are facing

  • We are experiencing periodic hangs of the GPFS client file system software used with the new storage environment. We have an open support case with the vendor, but no solution at this time. This may affect access to the /fs/project, and /fs/scratch file systems. Reports of transfer failures to these file systems through scp.osc.edu, and sftp.osc.edu have been reported.  
  • Symlinks transfered from /nfs/gpfs to /fs/project are lost (fixed)

2/26 Downtime Difficulties

All systems should be functioning normally. Please report any remaining issues to OSC Help.

----
A number of systems are still experiencing problems after yesterday's downtime. Currently, the following systems are still offline:

  • Oakley (returned to service)

  • ARMSTRONG (returned to service)

  • license server (returned to service)

  • proj11 (returned to service)

  • proj12 (returned to service)