Compute Need to Know

Login Nodes

Node names: ahl0[1-4] e.g., ahl01

Login nodes are the ‘front door’ into compute resources. They are intended for

  • Creating a new remote connection to MSI systems

  • Performing data transfer from a local machine using SFTP, scp or other supported data transfer protocol

  • Navigating file system such as interacting with Tier 1 and Tier 2 storage

  • Interfacing with the Slurm scheduler such as submitting jobs and checking their status

They are not intended for

  • Heavy compute - Processes that exceed the threshold will have their process killed after 15 minutes automatically

  • Installation of ‘large’ packages - some package installations rely on temp disk; users are advised to create an interactive session using srun and request temp disk using the --temp <amount>GB

Compute Nodes

Node names include, but are not limited to: acn*, aga*, cn*, n*

Compute resources are available to all users by default. Our Partitions page has information on the names of the partitions and the per partition limitation such as max number of nodes per job, max walltime and max number of jobs per user.

Maintenance Periods

At MSI there are monthly planned maintenance days wherein operations staff will apply systems changes and planned updates. The maintenance period (M-day) typically takes place the first Wednesday of the month starting at 5 AM and ending at around 5PM the same day. Information on the upcoming maintenance day is available on our Status Page

User communications are sent out ahead of the maintenance period with information on any user facing changes.

2026 Maintenance Days

  • 01/07/26 January

  • 02/04/26 February

  • 03/04/26 March

  • 04/01/26 April

  • 05/06/26 May

  • 06/03/26 June

  • 07/01/26 July

  • 08/05/26 August

  • 09/02/26 September

  • 10/07/26 October

  • 11/04/26 November

  • 12/02/26 December

  • 01/06/27 January

Leading up to M-Day:

As mentioned above, the maintenance window starts at 5am, this means that jobs that request a walltime that would run past the start of the window will be held in queue with a reason similar to ‘Reserved for maintenance…’.

Jobs may be resubmitted with a shorter walltime in order to become eligible to run.

../_images/maintenance_timeline.png

The remaining hours before the next M-Day can be determined with this command template. Enter this into any remote shell on MSI.

bc <<< "($(date -d '<date of next maintenance> 05:00' +%s) - $(date +%s)) / 3600"

-- Example
14:06:51 [vega0051@ahl02 ~ ]$ bc <<< "($(date -d '2026-05-06 05:00' +%s) - $(date +%s)) / 3600"
278

During Maintenance Day :

  • The job scheduler is suspended, no jobs will run.

    • This suspension also applies to private partitions.

    • New jobs may be submitted but will similarly be held until the scheduler is resumed after the maintenance period is complete.

User Resource Limits

Agate is a shared compute environment with limits per compute resource, user, and groups. These are the current user limits to consider when expanding a workflow.

Per user job limits

Max Jobs

All Partitions

5000

interactive

1

interactive-gpu

1

There are also limits to the number of individual jobs that can be placed into the queue. A user may submit within the rate specified below. If the limit is met, users will need to wait for a cool down period before being able to place additional jobs.

While job arrays have a high index limit, job steps are still limited by the max number of jobs per hour.

Max Values

Count

Jobs per Hour

5000

Job Array index

100,001

Compute Resource Limits

The Agate cluster is divided up into individual partitions that have a set of compute resources associated. Information on the existing partitions can be found on our Partitions page.

Handling Hardware Issues

Occasionally, compute resources may encounter issues that would prevent them from being able to perform as expected. To address this our operations staff have a Node Health Check (NHC) process that is executed on each node. This process is executed in a 15 minute interval to review the nodes vitals. Should a node fail to pass the NHC, it is automatically marked as ‘DRAIN’ and scheduled for reboot.

During this drain state, existing job will continue running and the scheduler will avoid placing new jobs on the node until after the node returns to service.