Some systems are experiencing issues

About This Site

Welcome on the T2B Cluster status page.

Please find status information about critical T2B cluster components, incidents and planned maintenance.

Mail subscription is available to get a notification when a component status change.

Stickied Incidents

Monday 12th August 2024

Mass Storage (/pnfs) Some issues with mass storage /pnfs [Rucio & Crab]

Hello,

Several users have reported that:

1/ Rucio does not allow copies to our RSE with error: Details: RSE excluded; not available for writing.

2/ Crab also complains that tasks can't be started because you are not allowed to write on your home directory on our site: Checkwrite Result: Unable to check write permission in /store/user/rougny on site T2_BE_IIHE

We are investigating both issues.

On the other hand, standard grid commands on your files (eg gfal-copy) seem to work without any issues.

Cheers, Romain

  • Dear all,

    After consulting with central CMS IT services, it seems that they have resolved the problem from their end. We also received confirmation from users that the rucio and crab indeed work as expected again.

    Kind regards,

    Olivier For the T2B Admin team

  • Past Incidents

    Saturday 20th July 2024

    No incidents reported

    Friday 19th July 2024

    Batch System downtime 22/07

    Hello,

    Because of the heat wave and the ongoing cooling works in the datacenter, part of the worker nodes have been shut down. You still have access to ~5700 job slots.

    REMINDER DOWNTIME: the batch system, /pnfs and the mX machines will be stopped after midnight monday 22/07 morning.

    Cheers, Romain

  • Hello,

    The cooling in the datacenter seems finally under control, so we were allowed to restart the /pnfs mass storage servers and the batch system.

    Note that today only ~3500 jobs slots have been started. More compute capacity will be added in steps to make sure the cooling can keep the charge.

    As we took the opportunity to perform several software upgrades, do not hesitate to inform us if anything does not work as expected !

    Cheers, The T2B IT Team

  • Hello,

    Unfortunately the maintenance on the cooling is still ongoing. Apparently since it's holiday time they are having a harder time to get experts on site in a timely fashion.

    We have requested from them to at least be able to start the /pnfs mass storage system. Hopefully we will get a positive response from them by tomorrow.

    Cheers, The T2B IT Team

  • Hello,

    The datacenter is still ongoing work on the cooling units. They hope to have it fixed today but will need the weekend to make sure it is stable. This means unfortunately we allowed to start any machine today.

    In light of this, we have decided to open up the mX machines, but you will NOT have access to:

    • /pnfs
    • the batch system (so no condor_* commands)

    We are very sorry for the disagreement, The T2B IT Team

  • Hello,

    Unfortunately the datacenter is still experiencing cooling issues and is not stable, so we cannot put any machine online. We are exchanging information with the people managing the datacenter, and will inform you as soon as the situation changes.

    Cheers, Romain

  • Hello,

    As expected the site is now under maintenance. Nothing will be accessible until further notice.

    We'll try to finish things as soon as possible.

    Cheers, Romain

  • Thursday 18th July 2024

    No incidents reported

    Wednesday 17th July 2024

    No incidents reported

    Tuesday 16th July 2024

    No incidents reported

    Monday 15th July 2024

    No incidents reported

    Sunday 14th July 2024

    No incidents reported