Downtime

login5 ran out of memory yesterday (27.02.2017) around 18:16 and took about 15 minutes to recover.

During this time the compute nodes were unable to contact the application scheduler running on login5 and some jobs might have crashed.
A typical error message for this case is: "aprun: Apid nnnnnnn: close of the compute node connection after app startup barrier".

We apologise for any inconvenience caused.

Four cabinets went down due to power issues caused by the storm. Storage controllers for /work-common are also affected.
Hexagon was started without /work-common filesystem.

We are trying to fix issues with the filesystem controllers and get back the filesystem in production as soon as possible.

Update 2016-12-27 14:50: Troubles with /work-common storage controllers were mitigated and filesystem is taken back online. Hexagon had to be rebooted today at 14:15. All systems are up and functional again.