# Slurm Job Submission and Scheduling

## Job submission and scheduling with Slurm

MSI uses Slurm as its job queueing and resource management system to efficiently and fairly manage the resources used by all users. When jobs are submitted to a queue (`partition` in Slurm nomenclature), they wait until the appropriate computational resources are available.

To submit jobs to a Slurm partition, users must first write Slurm job scripts. Slurm job scripts contain information on the resources requested for the calculation, as well as the commands for executing the calculation.

## Writing Job Scripts

A Slurm job script is a shell script with special lines that specify the resources required for the job as well as the commands to perform the computation. The script must be in plaintext.

If you are not sure where to start, copy one of the working examples below and edit it for your job. In most cases you will want to set the partition, walltime, CPU count, and memory explicitly rather than relying on defaults.

### Standard Batch Job

```bash
#!/bin/bash -l
#SBATCH --job-name=my-analysis
#SBATCH --partition=msismall
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --mem=32g
#SBATCH --time=01:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your-email@example.com

# ---------- Commands ----------
cd ~/program_directory
module purge
module load <software name>

# Keep threaded programs aligned with the CPUs requested from Slurm.
export OMP_NUM_THREADS=${SLURM_NTASKS}

./program_name inputfile > outputfile
```

### GPU Job

Use this pattern for applications that require GPU accelerators, such as deep learning or molecular dynamics.

```bash
#!/bin/bash -l
#SBATCH --job-name=my-gpu-job
#SBATCH --partition=msigpu
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --mem=32g
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:v100:1
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

# ---------- Commands ----------
cd ~/program_directory
module purge
module load <software name>

./program_name inputfile > outputfile
```

The `--gres=gpu:<type>:<count>` directive requests GPUs with a specific type (e.g., `v100`, `a40`) and count (e.g., `1`, `2`, `4`). Use `--gres=gpu:v100:2` for two V100 GPUs. You must select a partition that offers the GPU type you need. The `msigpu` partition provides shared GPU access. Private GPU partitions may also be available to your group.

### Job Array

Use this pattern to submit a collection of similar jobs that differ only by an input file or parameter.

```bash
#!/bin/bash -l
#SBATCH --job-name=my-array
#SBATCH --partition=msismall
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --mem=32g
#SBATCH --time=01:00:00
#SBATCH --array=1-10
#SBATCH --output=slurm-%A_%a.out
#SBATCH --error=slurm-%A_%a.err

# ---------- Commands ----------
cd ~/project_directory
module purge
module load <software name>

./program_name input_${SLURM_ARRAY_TASK_ID}.dat > output_${SLURM_ARRAY_TASK_ID}.txt
```

The `--array=1-10` directive creates 10 independent jobs, each with a different `SLURM_ARRAY_TASK_ID` value (1 through 10). The `%A` placeholder expands to the <Slurm Job ID> and `%a` expands to the array task ID, making each task's output files unique. Use the `SLURM_ARRAY_TASK_ID` variable inside the script to select different input files or parameters for each task. To limit the number of tasks that run concurrently, use `--array=1-100%5` (max 5 at once). Submit the script once with `sbatch` and check status with `squeue -j <Slurm Job ID>` or `sacct -X -j <Slurm Job ID>`.

In these examples:

- `--job-name` gives the job a readable name in `squeue` output.
- `--partition` selects the queue to use. The partition must match the kind of resources your job needs.
- `--output` and `--error` write standard output and standard error to files that include the Slurm job ID.
- `--ntasks` is the number of processor cores requested.

The script must include the commands needed to run your workload successfully, including moving to the correct directory and loading the required software modules.

The first line in the Slurm script defines which shell the script will be run with. This is required of all shell scripts, and Slurm job scripts are no exception. MSI recommends using Bash:

```bash
#!/bin/bash -l
```

Directives for the Slurm queuing system are used to specify the resources requested by the job; these lines begin with `#SBATCH`. For example, `--time=01:00:00` requests a walltime of one hour, `--ntasks=16` requests 16 processor cores, and `--mem=32g` requests 32 GB of memory.

The resource request must contain appropriate values; if the requested time, processors, or memory are not suitable for the hardware, the job will not be able to run.

The directives `#SBATCH --mail-type` and `#SBATCH --mail-user` control email notifications. The `--mail-type` directive instructs the Slurm system to send an email when the job begins, aborts with an error, or finishes successfully. Other options include `NONE` or `ALL`. The `--mail-user` directive specifies the email address to be used. Using message emails is recommended because the reason for a job failure can often be determined from those emails.

The rest of the script contains the commands that will be executed to perform the calculation. A Slurm script must contain the appropriate commands to set up and run the calculation, including `cd` and `module load` commands.

### Common reasons jobs do not start immediately

Slurm schedules jobs according to both the resources requested and the limits configured for users, groups, and partitions. Jobs can remain pending if:

- The requested partition does not have enough free resources yet
- The requested walltime, memory, node count, or local scratch exceeds a partition limit
- Your group or user has reached a job-count or resource-usage limit
- The request is larger than necessary and is harder for the scheduler to fit

If a job is pending, the `NODELIST(REASON)` column in `squeue` is often the fastest way to see why.

## Submitting Job Scripts

Once a job script is written it is submitted using the `sbatch` command:

```bash
sbatch scriptname
```

If you prefer, you can also provide the partition on the command line:

```bash
sbatch -p partitionname scriptname
```

MSI recommends specifying the partition either in the script or on the command line because it is easier to troubleshoot problems arising from insufficient or incompatible requested resources.

### Viewing and Canceling Jobs

To view the jobs submitted by a particular user, use the `squeue` command:

```bash
squeue --me
```

This displays your current jobs and their associated job ID numbers. When run with no options, `squeue` shows jobs for the whole system.

Useful variations include:

```bash
# Show one job with the reason it is pending or the node it is running on
squeue -j 12345678

# Show your jobs with a few more helpful columns
squeue --me -o "%.18i %.9P %.20j %.8T %.10M %.6D %R"
```

Some common job states you may see are:

- `PD`: pending, waiting for resources or limits
- `R`: running
- `CG`: completing
- `CD`: completed successfully
- `CA`: canceled
- `TO`: timed out

For completed jobs and older job history, use `sacct`:

```bash
# Jobs submitted during the last day
sacct --starttime now-1day --format=JobID,JobName%25,Partition,State,Elapsed,MaxRSS

# One specific job and its steps
sacct -j 12345678 --format=JobID,JobName%25,Partition,AllocTRES%35,State,Elapsed,ExitCode
```

`sacct` is especially useful after a job finishes because it shows accounting data that no longer appears in `squeue`, including elapsed time, exit code, and memory usage.

To cancel a submitted job, use the `scancel` command:

```bash
scancel jobIDnumber
```

Replace `jobIDnumber` with the appropriate job ID number determined by using the `squeue` command.

For a full list of available directives, see the [OPTIONS section of the `sbatch` command documentation](https://slurm.schedmd.com/sbatch.html#SECTION_OPTIONS.). All options to `sbatch` can be specified as `#SBATCH` directives in a job script.