Still some issues with /user impacting the batch system Tuesday 23rd April 2024 13:11:39


Hello,

Unfortunately we keep having issues with /user, that impacts the batch system.

On some worker nodes this means jobs using /user get stuck and do not finish in a timely manner.

Another consequence is some folder(s) used by your jobs are inaccessible. This is because somewhere on the cluster, one or more jobs are stuck accessing it. As soon as we kill the corresponding stuck jobs, the directory becomes accessible again.

This has happen a few times already in the past weeks. We always identify a workflow that impacts negatively /user, and we talk with the people to fix it, so it's never the same jobs that seems to be the cause. We always try to fix everything asap by vacating jobs and decreasing the allowed running jobs to limit potential impacts, and it works for a wile. We're also investigating if there's no underlying issue, as we had been working fine for a few months ...

We're sorry for all the inconvenience, and we'll bring some news as soon as we have any !

Cheers, The T2B IT Team