Job Sizing and Efficiency
=========================

These are a collection of approaches to help determine the sweet spot for allocating the appropriate amount of resources for your workflow while minimizing impact on job prioritization.

What is Job size and why is it important?
-----------------------------------------

Job size is one of the main factors that affect job prioritization. Naturally jobs that request a greater amount of resources will sit longer in the job queue waiting for resources to become available compared to smaller jobs.

The size of a job is determined by the number of resources requested over time, this includes:

- Number of tasks/cpu threads
- Total memory that will be shared by tasks
- Count of additional resources such as GPUs
- Number of nodes requested

Correctly sizing jobs is crucial in a shared compute environment as it ensures that only the necessary resources are allocated to a task and no cpu cycles go unused.

Benefits to right-sizing slurm jobs
-----------------------------------
The Slurm scheduler utilizes the backfill mechanism to schedule jobs that fit in between existing jobs that are running. Smaller jobs generally start sooner, as the scheduler does not have to spend as much time finding large enough gaps in resources to match the request.

An additional benefit is that prioritization of jobs associated to a user/group is higher with smaller, more efficient jobs. Larger jobs are more 'expensive' because all requested resources count towards a group's overall raw compute usage. The larger the request, the higher percentage of the overall cluster is allocated to the group.

Migrating workflows to MSI Systems
----------------------------------

If you have existing workflows that need to be scaled up on MSI systems, it can be difficult to determine what is a reasonable amount of compute resource to request initially. Ideally, you will request just enough resources to perform the tasks for your batch jobs and interactive jobs. This means that there is not a surplus of resources allocated that are not utilized by processes.

For workflows that are coming from existing compute environments, we would recommend starting by recreating sessions with the same CPU/Memory count and at most 25-50% more resources. As an example, if your workflow is normally done on a local laptop with 16 cores and 32 GB system memory, you may create an interactive session on Open OnDemand that matches those resources using the custom options.

In the early stages when installing packages and adapting to a remote file system, utilizing sessions such as 'Persistent Desktop' on Open OnDemand can aid with getting familiar with the resources/software that are available.

Job performance insights
------------------------

In batch jobs, you can elect to receive a post-job email, which includes the resources actually used in your job. The following lines can be added to existing batch scripts to enable email support.

.. code::shell

  #SBATCH --mail-type=ALL --mail-user=<UMN email address>

At job completion, you will receive a message with details on CPU/Memory efficiency. Batch jobs should strive to be in the range of 80-90% efficient. Lower scores suggest that the resource allocation request could be adjusted for fewer resources and the job would perform the same.

Seff / Report Seff
------------------

The ``seff`` command allows you to check the resource efficiency scores for individual jobs that have completed. To use the command type ``seff <JOBID>`` where JOBID is replaced with the corresponding slurm job id.

The ``reportseff`` command is useful when you would want to see similar efficiency reports for jobs over a specified time period.

Example usage of ``reportseff``

Jobs over the 3 weeks

.. code::shell

  reportseff -u USERNAME d=now-3week

Help menus are available with ``reportseff --help``

Observing average wait time
---------------------------

Wait time for a job will depend on its prioritization and it can help to see trends over the last month on whether jobs have begun to take longer to start.

The command ``sacct`` can be used, such as in the following example to see how long jobs sat in the queue before they started.

Jobs for your primary group over the last 4 weeks

.. code::shell

 sacct -XaA GROUPNAME -S now-4week -o jobid,user,partition%20,planned,alloctres%45,state --units=GB

The default output has been re-formatted to show columns that are typically of interest, in this case they are:

- Jobid - Slurm job id
- user - user that placed the job
- partition - name of the partition the job ran on
- planned - how many hours/minutes/seconds the job waited before starting
- alloctres - what resources were allocated to the job
- state - job state e.g. COMPLETED, FAILED

Additional column options can be seen with the command sacct -e. Columns can be made wider as shown with the example above by using '%' and specifying the character length.

Measuring resource usage in real-time
-------------------------------------

Another approach to determining how much future job requests ought to ask for is to create interactive or batch job that has more resources than you anticipate needing and measuring usage in real-time.

While the process is running, commands such as ``top``, ``ps``, and can be used to check the utilization of resources.

Top - This tool is generally used to display linux processes. In this case, it can be used to monitor processes that are running.

The ``top`` interface can be opened by typing the top command while connected to the system running your process. To close, press ``q``.

Some columns of interest are

- ``PID`` - process ID
- ``%CPU`` - CPU usage, multithreaded processes will display values greater than 100%
- ``%MEM`` - Memory (RAM) usage
- ``COMMAND`` - Command that is running

This command will show
.. code:: shell

  top -Eg -p $(pgrep -d ' -p ' <process name>)

Show information on memory usage, note that this command may show the total memory available on the compute node itself rather than the amount requested in current job.

.. code:: shell

  09:50:02 [vega0051@ahl01 ~ ]$ free -ght
                total        used        free      shared  buff/cache   available
  Mem:          503Gi        42Gi        27Gi        25Gi       433Gi       430Gi
  Swap:            0B          0B          0B
  Total:        503Gi        42Gi        27Gi