Some systems are experiencing issues

About This Site

Welcome on the T2B Cluster status page.

Please find status information about critical T2B cluster components, incidents and planned maintenance.

Mail subscription is available to get a notification when a component status change.

Stickied Incidents

Tuesday 23rd April 2024

Batch System Still some issues with /user impacting the batch system

Hello,

Unfortunately we keep having issues with /user, that impacts the batch system.

On some worker nodes this means jobs using /user get stuck and do not finish in a timely manner.

Another consequence is some folder(s) used by your jobs are inaccessible. This is because somewhere on the cluster, one or more jobs are stuck accessing it. As soon as we kill the corresponding stuck jobs, the directory becomes accessible again.

This has happen a few times already in the past weeks. We always identify a workflow that impacts negatively /user, and we talk with the people to fix it, so it's never the same jobs that seems to be the cause. We always try to fix everything asap by vacating jobs and decreasing the allowed running jobs to limit potential impacts, and it works for a wile. We're also investigating if there's no underlying issue, as we had been working fine for a few months ...

We're sorry for all the inconvenience, and we'll bring some news as soon as we have any !

Cheers, The T2B IT Team

Past Incidents

Tuesday 2nd April 2024

No incidents reported

Monday 1st April 2024

No incidents reported

Sunday 31st March 2024

No incidents reported

Saturday 30th March 2024

No incidents reported

Friday 29th March 2024

No incidents reported

Thursday 28th March 2024

No incidents reported

Wednesday 27th March 2024

No incidents reported

Tuesday 26th March 2024

User Interfaces - mX machines Issues with /user

Hello,

Unfortunately we have encountered the same issue with /user as last Friday. Connection to some mX machines can be slow, and /user storage is either slow or blocked.

We are still trying to find the source and get it fixed definitively. Also, /pnfs is not affected.

Sorry for all the issues this causes.

Cheers, The T2B IT Team

  • Hello,

    We have finally found the source of the issues with /user. It was due to a wrong workflow of a user, so after removing jobs all is fixed.

    On this note, please make sure to NEVER have a single input data file that is read by all your jobs on /user. Our /user storage system cannot cope with thousands of jobs trying to read a single file. The correct workflow is to put the file(s) on /pnfs, then inform us so that we can make duplicates, which protects the storage system from harm.

    Cheers The T2B IT Team