Rolling reboot of Oakley and Ruby clusters, starting from 8:30AM October 9, 2017
We will have rolling reboots of Oakley and Ruby clusters starting from 8:30AM on Monday October 9, 2017.
We will have rolling reboots of Oakley and Ruby clusters starting from 8:30AM on Monday October 9, 2017.
4:56PM 3/28/2017 Update: The rolling reboots of all systems are completed.
We upgraded to RHEL 6.8 for both Oakley and Ruby clusters during the October 12th's downtime. Unfortunately, we are noticing some NFS problem that has been causing rsh, or ssh sessions to hang on Oakley and Ruby. To resolve this issue, we've downgraded the kernel version to one that is not exhibiting the NFS regression, and started to reboot compute nodes on Oakley and Ruby. It won't affect any running jobs, but users may experience longer queue wait time on Oakley and Ruby.
Update: Downtime completed at 6:30PM, June 7th.
The June 7th downtime is now slated to be completed at 6:30PM. Previous estimate was 5PM.
All systems and services will continue to be unavailable until that time.
Thank you for your cooperation.
Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug. The bug (or issue otherwise) seems to be activated when a user does operations on a lustre directory that contains an excessive number of files (10000+ files).
Our support contacts have been contacted and we are working with them to resolve this issue. Updates will be posted both here.
One of the Oakley login nodes is down. We are currently working on bringing it back online. SSH connections to oakley.osc.edu may time out. A workaround is to connect directly to oakley01.osc.edu.