2026-06-17

  • A new system image is being rolled out with GPFS 5.2.3.8 for improved kernel compatibility

2026-05-22

  • A new system image is being rolled out with a fix for CVE-2026-46300 and CVE-2026-46333
  • Slurm has been updated to 25.11.6

2026-05-08

  • A new system image is being rolled out with a fix for CVE-2026-31431 (CopyFail), which will restore the impacted kernel cryptographic functions.
  • To mitigate CVE-2026-43284 (Dirty Frag), we have disabled the following kernel modules: esp4, esp6 and rxrpc. We expect no impact on HPC software and workflows.

2026-04-30

To mitigate CVE-2026-31431 (CopyFail) we have disabled the impacted functions. Please note that there might be a performance impact for functionality that uses kernel cryptographic functions.

2026-04-29

The latest maintenance mainly revolved around improving the Slurm setup and rolling out the latest updates to various packages

  • Upgraded to Slurm 25.11.5
  • Slurm job requeue is enabled by default (use --no-requeue to disable)
  • New system image deployment
    • Kernel 5.14.0-570.106.1.el9_6.x86_64
    • DOCA (InfiniBand) 3.2.2
    • GPFS 5.2.3-6
  • Enabled DHCP snooping on switches
  • IPoIB MTU from 2K to 4K and rate change from 10 Gbps to 100 Gbps for internal testing

2026-02-11

We have deployed a new system image to mitigate CVE-2025-40248, which affected our virtualized compute nodes. We have found no evidence of exploitation for this vulnerability. The entire rollout has completed without requiring any full system downtime.

  • Upgraded to kernel 5.14.0-570.84.1

  • Upgraded to Nvidia 590.48.01

2026-01-09

We have deployed a new system image to mitigate CVE-2025-38499. We have found no evidence of exploitation for this vulnerability. The entire rollout has completed without requiring any full system downtime.

  • Upgraded to Red Hat Enterprise Linux 9.6
  • Upgraded to kernel 5.14.0-570.76.1

  • Upgraded to DOCA 3.2.0

2025-10-17

We have deployed a new system image to proactively mitigate a potential security vulnerability. The update was prioritized for GPU nodes and is now rolling out to all other nodes at a standard deployment pace. Crucially, we found no evidence of exploitation for this vulnerability. The entire rollout was completed without requiring any full system downtime.

  • Upgraded to kernel 5.14.0-427.92.1
  • Upgraded to Nvidia 580.95.05 (fixes CVE-2025-23280, CVE-2025-23330)

2025-09-25

On September 25th we scheduled a two day maintenance but the system was already back in production on the 26th at 10:00 CEST. We applied the following changes.

  • The minimal accepted GPFS client version in the cluster was set to the latest version (5.2.2)
  • Firewall improvements were tested and will be applied in a next maintenance window

  • We switched to a different UFM (fabric manager) system to improve compatibility with the OSSC environment

2025-09-15

On September 11th, we had to perform an emergency maintenance to address a critical vulnerability with known local exploits. The maintenance fixed this issue and also resolved the CVE that forced us to take the cbuild  partition offline on August 20th. The cbuild  partition is now available again. This maintenance took longer than usual because the new kernel was not compatible with GPFS. IBM had to implement a fix over the weekend.

  • Kernel: New kernel version 5.14.0-427.88.1 which fixes CVE-2025-38352 and CVE-2025-38052.
  • Slurm: Minor update to the job scheduler to Slurm 25.05.3.
  • Containerization: Enhanced support for containers with updates to Podman 4.9.4 and its related ecosystem.
  • Parallel Filesystem: Updated GPFS to 5.2.2.1
  • Management Software: SaltStack has been updated to 3007.7.
  • CUDA driver: updated to 580, upgrading to CUDA 13
  • No labels