Downtime

28/02, 2017

login5 ran out of memory yesterday (27.02.2017) around 18:16 and took about 15 minutes to recover.

During this time the compute nodes were unable to contact the application scheduler running on login5 and some jobs might have crashed.
A typical error message for this case is: "aprun: Apid nnnnnnn: close of the compute node connection after app startup barrier".

We apologise for any inconvenience caused.
26/12, 2016

Four cabinets went down due to power issues caused by the storm. Storage controllers for /work-common are also affected.
Hexagon was started without /work-common filesystem.

We are trying to fix issues with the filesystem controllers and get back the filesystem in production as soon as possible.

Update 2016-12-27 14:50: Troubles with /work-common storage controllers were mitigated and filesystem is taken back online. Hexagon had to be rebooted today at 14:15. All systems are up and functional again.