Some systems are experiencing issues

About This Site

Welcome on the T2B Cluster status page.

Please find status information about critical T2B cluster components, incidents and planned maintenance.

Mail subscription is available to get a notification when a component status change.

Stickied Incidents

Monday 12th August 2024

Mass Storage (/pnfs) Some issues with mass storage /pnfs [Rucio & Crab]

Hello,

Several users have reported that:

1/ Rucio does not allow copies to our RSE with error: Details: RSE excluded; not available for writing.

2/ Crab also complains that tasks can't be started because you are not allowed to write on your home directory on our site: Checkwrite Result: Unable to check write permission in /store/user/rougny on site T2_BE_IIHE

We are investigating both issues.

On the other hand, standard grid commands on your files (eg gfal-copy) seem to work without any issues.

Cheers, Romain

  • Dear all,

    After consulting with central CMS IT services, it seems that they have resolved the problem from their end. We also received confirmation from users that the rucio and crab indeed work as expected again.

    Kind regards,

    Olivier For the T2B Admin team

  • Past Incidents

    Tuesday 30th April 2024

    No incidents reported

    Monday 29th April 2024

    No incidents reported

    Sunday 28th April 2024

    No incidents reported

    Saturday 27th April 2024

    No incidents reported

    Friday 26th April 2024

    No incidents reported

    Thursday 25th April 2024

    No incidents reported

    Wednesday 24th April 2024

    No incidents reported

    Tuesday 23rd April 2024

    Batch System Still some issues with /user impacting the batch system

    Hello,

    Unfortunately we keep having issues with /user, that impacts the batch system.

    On some worker nodes this means jobs using /user get stuck and do not finish in a timely manner.

    Another consequence is some folder(s) used by your jobs are inaccessible. This is because somewhere on the cluster, one or more jobs are stuck accessing it. As soon as we kill the corresponding stuck jobs, the directory becomes accessible again.

    This has happen a few times already in the past weeks. We always identify a workflow that impacts negatively /user, and we talk with the people to fix it, so it's never the same jobs that seems to be the cause. We always try to fix everything asap by vacating jobs and decreasing the allowed running jobs to limit potential impacts, and it works for a wile. We're also investigating if there's no underlying issue, as we had been working fine for a few months ...

    We're sorry for all the inconvenience, and we'll bring some news as soon as we have any !

    Cheers, The T2B IT Team

  • Hello,

    The storage now seems stable for a while. We assessing the situation as stable.

    Cheers, Romain