Some systems are experiencing issues

About This Site

Welcome on the T2B Cluster status page.

Please find status information about critical T2B cluster components, incidents and planned maintenance.

Mail subscription is available to get a notification when a component status change.

Stickied Incidents

Monday 12th August 2024

Mass Storage (/pnfs) Some issues with mass storage /pnfs [Rucio & Crab]

Hello,

Several users have reported that:

1/ Rucio does not allow copies to our RSE with error: Details: RSE excluded; not available for writing.

2/ Crab also complains that tasks can't be started because you are not allowed to write on your home directory on our site: Checkwrite Result: Unable to check write permission in /store/user/rougny on site T2_BE_IIHE

We are investigating both issues.

On the other hand, standard grid commands on your files (eg gfal-copy) seem to work without any issues.

Cheers, Romain

  • Dear all,

    After consulting with central CMS IT services, it seems that they have resolved the problem from their end. We also received confirmation from users that the rucio and crab indeed work as expected again.

    Kind regards,

    Olivier For the T2B Admin team

  • Past Incidents

    Saturday 6th July 2024

    No incidents reported

    Friday 5th July 2024

    No incidents reported

    Thursday 4th July 2024

    No incidents reported

    Wednesday 3rd July 2024

    No incidents reported

    Tuesday 2nd July 2024

    No incidents reported

    Monday 1st July 2024

    Network Datacenter issue impacting network

    Hello,

    There seems to be an issue in the datacenter, as the temperature of all servers are off-the-chart. In particular, the NAT which allows you to access public IPv4 addresses from the batch system keeps stopping to protect itself.

    Investigation is under way.

    Cheers, The T2B IT Team

  • Hello,

    In order to protect the equipment, we decided to stop preventively about 60% of the worker nodes in the batch system.

    In the meantime, a technician was dispatched and fixed the cooling, we have confirmed that the overwhole temperature is back within sensible values. The worker nodes will be rebooted tomorrow morning to provide the whole capacity to the batch system

    Cheers, Romain

  • Sunday 30th June 2024

    No incidents reported