Major network switch outage

Category: 
Resolution: 
Resolved

01:20 PM 11/14/2018 Update:

All user-facing issues have been resolved and the services are back. Running jobs may recover, but please look at job output to verify correctness. Some jobs experienced failures and will need to be resubmitted. We will be evaluating job logs to identify jobs that failed. 

We've opened a ticket with the vendor to address ongoing issues and perform a root cause analysis. We are also evaluating the impact on clients and will update the user community once we know more.

We apologize for the inconvenience this may have caused you. Please contact OSC Help if you have any questions.

09:40 AM 11/14/2018 Update:

Login nodes of all clusters are available now. The home directory is accessible for most users. However, users whose home directories were impacted during the outage are still failing to connect to the systems. Reboots of all clusters including both login and compute nodes might be needed to bring the systems back to the healthy state. We are also evaluating the impact on clients and will update the user community once we know more.  

We apologize for the inconvenience this may have caused you. Please contact OSC Help if you have any questions.

Original Post: 

At about 1:50 AM on November 14th, OSC experienced a major switch failure which resulted in the home directory service being disrupted. The switches have been recovered, but the home directories are still offline. As a result, all logins are currently failing to all clusters. There are likely a large number of job failures. OSC engineers are working on the issue, and we will update this text and send an email to the user community when we have a fuller picture of the extent of the impact and the recovery time.

We apologize for the disruption, and ask for your patience in allowing us time to respond to your inquiries.