T2B cluster status page

Stickied Incidents

Friday 31st October 2025

Mass Storage (/pnfs) /pnfs has sometime issues

Hello,

For the last few days, /pnfs is having performance issues, while some storage nodes can't serve files. We have not identified the cause yet.

Mainly cd/ls becomes slow, and only a restart of the service makes it work fast again. Also, a couple of storage nodes have a hard time serving files (hence why some of your files have issues), this usually solves itself after a while.

Cheers, The T2B IT Team

Past Incidents

Sunday 30th June 2019

No incidents reported

Saturday 29th June 2019

No incidents reported

Friday 28th June 2019

Mass storage /pnfs issue

We are facing an 'Xrootd storm". This is due to a lot of remote transfers initiated probably by "ignore locality jobs". We had to restart the mass storage services. Some of your jobs might have crashed.

Thursday 27th June 2019

No incidents reported

Wednesday 26th June 2019

Batch System Batch system did not accept jobs anymore

It seems the batch system did not accept jobs anymore. The service was rebooted and is now working properly again.

User Interfaces - mX machines /cvmfs is not working properly on m9

/cvmfs seems to be outdated on m9, compared to other machines.

m9 needed a complete re-install to fix the problem. /cvmfs should be up-to-date and same as other machines now.

To fix the issue with /cvmfs on m9, we need to empty the cache and reboot the machine. This will be done at 11PM today, Thursday 27/06, so please save your work and exit m9.

Tuesday 25th June 2019

Batch System Incident with cooling at Computing Center

There is an unknown issue with the cooling in the datacenter room. Therefore a big part of the worker nodes were forcefully stopped. Some of your jobs will have therefore failed.

If the cooling comes back, nodes will be restarted tomorrow morning. If it continues failing, the temperature will force us to stop the batch system and eventually the storage. We will update this incident regularly.

Last Friday evening, intervention on the electrical board for the cooling system fixed the issue. As the temperature was stable during the weekend, it was convened with the datacenter operators that we could bring back all our machines.

The batch system is now again at full capacity.

The issue with the cooling has been diagnosed today as an electrical issue. As tentative to fix it will only be tried tomorrow, and in view of the heat-wave of Saturday, we will keep the batch system at half-capacity as it is right now for the weekend.

Situation will be re-evaluated Monday morning.

Because of the temperature, the /pnfs headnode went down. It is now back so /pnfs should be accessible.

Cooling was only partially restored, so only a minimal number of job slots are available. Unless situation starts to worsen again, we will keep mX-machines and storage up. Complete availability of the batch system tomorrow will depend on whether the cooling machine can be fixed.

Monday 24th June 2019

No incidents reported

About This Site

Stickied Incidents

Friday 31st October 2025

Past Incidents

Sunday 30th June 2019

Saturday 29th June 2019

Friday 28th June 2019

Thursday 27th June 2019

Wednesday 26th June 2019

Tuesday 25th June 2019

Monday 24th June 2019