Batch System Reboot | T2B cluster status page

Hello,

The batch system has finished the upgrade draining, now all worker nodes are back online.

On another note, you might have noticed schedd03 not working (happened on the 3rd, the 5th and this morning), that is a consequence of /user issues. On those cases, running jobs might be requeued and waiting to run again. This should be explicit in the job log file.

We are continuing our investigation on the issue to find a way to prevent that, however finding exactly what causes the load on our /user storage is not easy.

One point that might help schedd03 is to make sure 'should_transfer_files = NO' in your sub file, unless really required. If you set it to YES, then ALL transfers have to be done by schedd03, potentially making it a bottleneck if many of your jobs are finishing at the same time. Since /user and /pnfs are available on all worker nodes, you can just copy what you want back to either at the end of your job, rather than asking condor to do it.

Do not hesitate to contact us if you have questions or want confirmation about your workflow on the batch system !

Cheers The T2B IT Team

Batch System Reboot Tuesday 9th April 2024 10:42:35