Some systems are experiencing issues

About This Site

Welcome on the T2B Cluster status page.

Please find status information about critical T2B cluster components, incidents and planned maintenance.

Mail subscription is available to get a notification when a component status change.

Stickied Incidents

Tuesday 23rd April 2024

Batch System Still some issues with /user impacting the batch system

Hello,

Unfortunately we keep having issues with /user, that impacts the batch system.

On some worker nodes this means jobs using /user get stuck and do not finish in a timely manner.

Another consequence is some folder(s) used by your jobs are inaccessible. This is because somewhere on the cluster, one or more jobs are stuck accessing it. As soon as we kill the corresponding stuck jobs, the directory becomes accessible again.

This has happen a few times already in the past weeks. We always identify a workflow that impacts negatively /user, and we talk with the people to fix it, so it's never the same jobs that seems to be the cause. We always try to fix everything asap by vacating jobs and decreasing the allowed running jobs to limit potential impacts, and it works for a wile. We're also investigating if there's no underlying issue, as we had been working fine for a few months ...

We're sorry for all the inconvenience, and we'll bring some news as soon as we have any !

Cheers, The T2B IT Team

Past Incidents

Tuesday 9th April 2024

Batch System Batch System Reboot

Hello,

In order to prepare for the switch from Grid certificates to Tokens, the batch system has been upgraded to a new minor version. This has led to most of the worker nodes to drain in order the restart the condor service.

This explains why most your the jobs are still queued. They will start again as soon as more and more worker nodes have the new condor version.

Cheers, The T2B IT Team

  • Hello,

    The batch system has finished the upgrade draining, now all worker nodes are back online.

    On another note, you might have noticed schedd03 not working (happened on the 3rd, the 5th and this morning), that is a consequence of /user issues. On those cases, running jobs might be requeued and waiting to run again. This should be explicit in the job log file.

    We are continuing our investigation on the issue to find a way to prevent that, however finding exactly what causes the load on our /user storage is not easy.

    One point that might help schedd03 is to make sure 'should_transfer_files = NO' in your sub file, unless really required. If you set it to YES, then ALL transfers have to be done by schedd03, potentially making it a bottleneck if many of your jobs are finishing at the same time. Since /user and /pnfs are available on all worker nodes, you can just copy what you want back to either at the end of your job, rather than asking condor to do it.

    Do not hesitate to contact us if you have questions or want confirmation about your workflow on the batch system !

    Cheers The T2B IT Team

  • Monday 8th April 2024

    No incidents reported

    Sunday 7th April 2024

    No incidents reported

    Saturday 6th April 2024

    No incidents reported

    Friday 5th April 2024

    No incidents reported

    Thursday 4th April 2024

    No incidents reported

    Wednesday 3rd April 2024

    No incidents reported