For the last few days, /pnfs is having performance issues, while some storage nodes can't serve files.
We have not identified the cause yet.
Mainly cd/ls becomes slow, and only a restart of the service makes it work fast again.
Also, a couple of storage nodes have a hard time serving files (hence why some of your files have issues), this usually solves itself after a while.
Cheers,
The T2B IT Team
Past Incidents
Sunday 30th June 2019
No incidents reported
Saturday 29th June 2019
No incidents reported
Friday 28th June 2019
Mass storage /pnfs issue
We are facing an 'Xrootd storm". This is due to a lot of remote transfers initiated probably by "ignore locality jobs". We had to restart the mass storage services. Some of your jobs might have crashed.
Thursday 27th June 2019
No incidents reported
Wednesday 26th June 2019
Batch SystemBatch system did not accept jobs anymore
It seems the batch system did not accept jobs anymore. The service was rebooted and is now working properly again.
User Interfaces - mX machines/cvmfs is not working properly on m9
/cvmfs seems to be outdated on m9, compared to other machines.
m9 needed a complete re-install to fix the problem. /cvmfs should be up-to-date and same as other machines now.
To fix the issue with /cvmfs on m9, we need to empty the cache and reboot the machine. This will be done at 11PM today, Thursday 27/06, so please save your work and exit m9.
Tuesday 25th June 2019
Batch SystemIncident with cooling at Computing Center
There is an unknown issue with the cooling in the datacenter room.
Therefore a big part of the worker nodes were forcefully stopped.
Some of your jobs will have therefore failed.
If the cooling comes back, nodes will be restarted tomorrow morning.
If it continues failing, the temperature will force us to stop the batch system and eventually the storage.
We will update this incident regularly.
Last Friday evening, intervention on the electrical board for the cooling system fixed the issue. As the temperature was stable during the weekend, it was convened with the datacenter operators that we could bring back all our machines.
The batch system is now again at full capacity.
The issue with the cooling has been diagnosed today as an electrical issue. As tentative to fix it will only be tried tomorrow, and in view of the heat-wave of Saturday, we will keep the batch system at half-capacity as it is right now for the weekend.
Situation will be re-evaluated Monday morning.
Because of the temperature, the /pnfs headnode went down. It is now back so /pnfs should be accessible.
Cooling was only partially restored, so only a minimal number of job slots are available.
Unless situation starts to worsen again, we will keep mX-machines and storage up.
Complete availability of the batch system tomorrow will depend on whether the cooling machine can be fixed.