issue with a storage node affetcing /pnfs Thursday 13th February 2025 15:22:00

There is a storage node that seems unstable and requires its reboot. It affected part of /pnfs with ~500TB not accessible.

We are in the process of getting it back running asap, and since it is its second offense, we're decommissioning it.

The array has finally finished repairing safely.

Unfortunately, the filesystem on top also had consistency issues because of the disk failures. It was also repaired in the morning, but that resulted in 29 files not being recovered.

You can find the list of those files here on the cluster: /group/lost_files/20250219.filelist.txt Please look into it, as it might have impacted some of your files (a few users/experiments have files impacted).

In the meantime, the /pnfs system is rebuilding the file DB on the pool and scanning anew all files. This process is also estimated to take time, but thankfully all files scanned are immediately available again on /pnfs. This means that less and less files will still be inaccessible over time.

The issue has been identified as 2 disks failing simultaneously on the same RAID array.

No data has been lost (we are protected against 2 failing disks), and the array is busy rebuilding on 2 hot-spare new disks. Unfortunately that will take some time (at minimum 3-4 days), and the 120TB this pool hosts will not be accessible in the meantime.