For the last few days, /pnfs is having performance issues, while some storage nodes can't serve files.
We have not identified the cause yet.
Mainly cd/ls becomes slow, and only a restart of the service makes it work fast again.
Also, a couple of storage nodes have a hard time serving files (hence why some of your files have issues), this usually solves itself after a while.
Cheers,
The T2B IT Team
Past Incidents
Sunday 16th February 2025
No incidents reported
Saturday 15th February 2025
No incidents reported
Friday 14th February 2025
No incidents reported
Thursday 13th February 2025
Mass Storage (/pnfs)issue with a storage node affetcing /pnfs
There is a storage node that seems unstable and requires its reboot.
It affected part of /pnfs with ~500TB not accessible.
We are in the process of getting it back running asap, and since it is its second offense, we're decommissioning it.
The array has finally finished repairing safely.
Unfortunately, the filesystem on top also had consistency issues because of the disk failures.
It was also repaired in the morning, but that resulted in 29 files not being recovered.
You can find the list of those files here on the cluster: /group/lost_files/20250219.filelist.txt
Please look into it, as it might have impacted some of your files (a few users/experiments have files impacted).
In the meantime, the /pnfs system is rebuilding the file DB on the pool and scanning anew all files. This process is also estimated to take time, but thankfully all files scanned are immediately available again on /pnfs.
This means that less and less files will still be inaccessible over time.
The issue has been identified as 2 disks failing simultaneously on the same RAID array.
No data has been lost (we are protected against 2 failing disks), and the array is busy rebuilding on 2 hot-spare new disks.
Unfortunately that will take some time (at minimum 3-4 days), and the 120TB this pool hosts will not be accessible in the meantime.