SLURM

From ACCRE Wiki

SLURM (Simple Linux Utility for Resource Management) is a software package for submitting, scheduling, and monitoring jobs on large compute clusters.  This page details how to use SLURM for submitting and monitoring jobs on ACCRE’s Vampire cluster. New cluster users should consult our Getting Started pages Submit First Job, which is designed to walk you through the process of creating a job script, submitting a job to the cluster, monitoring jobs, checking job usage statistics, and understanding our cluster policies. SLURM has been in use for job scheduling since early 2015; previously Torque and Moab were used for that purpose.

This page describes the basic commands of SLURM. For more advanced topics, see the page on Parallel Processing and Job Arrays. ACCRE staff have also created a set of commands (see the page Commands for Job Monitoring) to assist you in scheduling and managing your jobs.

All the examples on this page can be downloaded from ACCRE’s Github page by issuing the following commands from a cluster gateway:

module load GCC git
git clone https://github.com/accre/SLURM.git

Batch Scripts

The first step for submitting a job to SLURM is to write a batch script, as shown below. The script includes a number of #SBATCH directive lines that tell SLURM details about your job, including the resource requirements for your job. For example, the example below is a simple Python job requesting 1 node, 1 CPU core, 500 MB of RAM, and 2 hours of wall time. Note that specifying the node (#SBATCH --nodes=1 ) and CPU core ( #SBATCH --ntasks=1 ) count must be broken off into two lines in SLURM.

#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1    # comments allowed 
#SBATCH --ntasks=1
#SBATCH --time=00:10:00
#SBATCH --mem=500M
#SBATCH --output=python_job_slurm.out

# These are comment lines
# Load the Anaconda distribution of Python, which comes
# pre-bundled with many of the popular scientific computing tools like
# numpy, scipy, pandas, scikit-learn, etc.
module load Anaconda2

# Pass your Python script to the Anaconda2 python intepreter for execution
python vectorization.py

Note that a SLURM batch script must begin with the #!/bin/bash directive on the first line. The subsequent lines begin with the SLURM directive #SBATCH followed by a resource request or other pertinent job information. Email alerts will be sent to the specified address when the job begins, aborts, and ends. Below the #SBATCH directives are the Linux commands needed to run your program or analysis. Once your job has been submitted via the sbatch command (details shown below), SLURM will match your resource requests with idle resources on the cluster, run your specified commands on one or more compute nodes, and then email you (if requested in your batch script) when your job begins, ends, and/or fails.

Here is a list of basic #SBATCH directives:

#SBATCH Directive Description
--nodes=[count] Node count
--tasks-per-node=[count] Processes per node
--ntasks=[count] Total processes (across all nodes)
--cpus-per-task=[count] CPU cores per process
--nodelist=[nodes] Job host preference
--exclude=[nodes] Job host to avoid
--time=[min] or --time=[dd-hh:mm:ss] Wall clock limit
--mem=[count] RAM per node
--output=[file_name] Standard output file
--error=[file_name] Standard error file
--array=[array_spec] Launch job array
--mail-user=[email_address] Email for job alerts
--mail-type=[BEGIN or END or FAIL or REQUEUE or ALL] Email alert type
--account=[account] Account to charge
--depend=[state:job_id] Job dependency
--job-name=[name] Job name
--constraint=[attribute] Request node attribute (skylake, sandy_bridge, haswell, csbtmp etc.)
--partition=[name] Submit job to specified partition (production, a6000x4, turing etc.)
--gres=gpu:2 Requesting 2 GPUs for use

Note that the --constraint option allows a user to target certain processor families.

Partitions (Queues)

The partitions in Slurm can be considered job queues, each of partition has an assortment of constraints such as job size limit, job time limit, GPU types etc. Priority-ordered jobs are allocated within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted. In ACCRE we have the non-GPU partition production, and all the others are different GPU partitions. For the GPU partition list please refer to the link GPUs at ACCRE.

Commands

SLURM offers a number of helpful commands for tasks ranging from job submission and monitoring to modifying resource requests for jobs that have already been submitted to the queue. Below is a list of SLURM commands:

Slurm Commands
sbatch [job_script] Job submission
squeue Job/Queue status
scancel [JOB_ID] Job deletion
scontrol hold [JOB_ID] Job hold
scontrol release [JOB_ID] Job release
sinfo Cluster status
salloc Launch interactive job
srun [command] Launch (parallel) job step
sacct Displays job accounting information


sbatch

The sbatch command is used for submitting jobs to the cluster. sbatch accepts a number of options either from the command line, or (more typically) from a batch script. An example of a SLURM batch script (called simple.slurm ) is shown below:

#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=0-00:15:00     # 15 minutes
#SBATCH --output=my.stdout
#SBATCH --job-name=just_a_test

# Put commands for executing job below this line
# This example is loading the Anaconda distribution of Python and
# writing out the version of Python
module load Anaconda2
python --version

To submit this batch script, a user would type:

sbatch simple.slurm

This job (called just_a_test ) requests 1 compute node, 1 task (by default, SLURM will assign 1 CPU core per task), 1 GB of RAM per CPU core, and 15 minutes of wall time (the time required for the job to complete). Note that these are the defaults for any job, but it is good practice to include these lines in a SLURM script in case you need to request additional resources.

Optionally, any #SBATCH line may be replaced with an equivalent command-line option. For instance, the #SBATCH --ntasks=1 line could be removed and a user could specify this option from the command line using:

sbatch --ntasks=1 simple.slurm

The commands needed to execute a program must be included beneath all #SBATCH commands. Lines beginning with the # symbol (without /bin/bash or SBATCH) are comment lines that are not executed by the shell. The example above simply prints the version of Python loaded in a user’s path. It is good practice to include any module load commands in your SLURM script. A real job would likely do something more complex than the example above, such as read in a Python file for processing by the Python interpreter.

For more information about sbatch see: http://slurm.schedmd.com/sbatch.html

squeue

squeue is used for viewing the status of jobs. By default, squeue will output the following information about currently running jobs and jobs waiting in the queue: Job ID, Partition, Job Name, User Name, Job Status, Run Time, Node Count, and Node List. There are a large number of command-line options available for customizing the information provided by squeue . Below are a list of examples:


Command Meaning
squeue --long Provide more job information
squeue --user=USER_ID Provide information for USER_ID’s jobs
squeue --account=ACCOUNT_ID Provide information for jobs running under ACCOUNT_ID
squeue --Format=account,username,numcpus,state,reason Customize output of squeue
squeue --states=running Show running jobs only
scontrol update dependency=JOB_ID Add job dependency so that job only starts after JOB_ID completes
scontrol --help Show all options

For more information about squeue see: http://slurm.schedmd.com/squeue.html

sacct

This command is used for viewing information for completed jobs. This can be useful for monitoring job progress or diagnosing problems that occurred during job execution. Our script rtracejob is just a wrapper of the sacct command, hence you can use rtracejob to grasp the information output by sacct.

By default, sacct will report Job ID, Job Name, Partition, Account, Allocated CPU Cores, Job State, and Exit Code for all of the current user’s jobs that completed since midnight of the current day. Many options are available for modifying the information output by sacct :

Commands Meaning
sacct --starttime 2022-12-04T00:00:00 Show the jobs since midnight of Dec 4, 2022 at 00:00:00
sacct --accounts=ACCOUNT_ID Show information for all users under ACCOUNT_ID
sacct --format=User,Timelimit,nodelist,ReqTRES,AllocTRES Show listed job information

It's worthy to note that --accounts can only display your account data. For example, if you are CSB users, you can only display CSB job information. If you try to display other account job archive data with this option, no information returns.

The --format option is particularly useful, as it allows a user to customize output of job usage statistics. Our rtracejob script is also based on this sacct outputs. For instance, the elapsedand Timelimit arguments allow for a comparison of allocated vs. actual wall time. ReqTRES and AllocTRES tracks the requested and allocated resources, such as cpu cores, memory level, GPUs.

For more information about sacct see: http://slurm.schedmd.com/sacct.html

scontrol

scontrol is used for monitoring and modifying queued jobs, as well as holding and releasing jobs. One of its most powerful options is the scontrol show job option. Below is a list of useful scontrol commands:

Command Meaning
scontrol show job JOB_ID Show information for queued or running job
scontrol release job JOB_ID Release hold on job
scontrol hold JOB_ID Place hold on job
scontrol show nodes Show hardware details for nodes on cluster
scontrol update JobID=JOB_ID Timelimit=1-12:00:00 Change wall time to 1 day 12 hours
scontrol update dependency=JOB_ID Add job dependency so that job only starts after JOB_ID completes
scontrol --help Show all options

Please note that the time limit or memory of a job can only be adjust for pending jobs, not for running jobs.

For more information about scontrol see: http://slurm.schedmd.com/scontrol.html

salloc

The function of salloc is to launch an interactive job on compute nodes. This can be useful for troubleshooting/debugging a program or if a program requires user input. To launch an interactive job requesting 1 node, 2 CPU cores, and 1 hour of wall time, a user would type:

salloc --nodes=1 --ntasks=2 --time=1:00:00

This command will execute and then wait for the allocation to be obtained. Once the allocation is granted, an interactive shell is initiated on the allocated node (or one of the allocated nodes, if multiple nodes were allocated). At this point, a user can execute normal commands and launch his/her application like normal.

Note that all of the sbatch options are also applicable for salloc , so a user can insert other typical resource requests, such as memory. Another useful feature in salloc is that it enforces resource requests to prevent users or applications from using more resources than were requested. For example:

[bob@vmps12 ~]$ salloc --nodes=1 --ntasks=2 --time=1:00:00
salloc: Pending job allocation 1772833
salloc: job 1772833 queued and waiting for resources
salloc: job 1772833 has been allocated resources
salloc: Granted job allocation 1772833
[bob@vmp586 ~]$ hostname
vmp586
[bob@vmp586 ~]$ srun -n 2 hostname
vmp586
vmp586
[bob@vmp586 ~]$ srun -n 4 hostname
srun: error: Unable to create job step: More processors requested than permitted
[bob@vmp586 ~]$ exit
exit
srun: error: vmp586: task 0: Exited with exit code 1
salloc: Relinquishing job allocation 1772833
salloc: Job allocation 1772833 has been revoked.
[bob@vmps12 ~]$

In this example, srun -n 4 failed because only 2 tasks were allocated for this interactive job. Also note that typing exit during the interactive session will kill the interactive job, even if the allotted wall time has not been reached.

For more information about salloc see: http://slurm.schedmd.com/salloc.html

srun

Finally, srun is used to create job arrays for parallel processing. More information about srun is available in Parallel Processing and Job Arrays.

Environmental Variables

Variables Meaning
SLURM_JOBID Job ID
SLURM_JOB_NODELIST Names of nodes allocated to job
SLURM_ARRAY_TASK_ID Task id within job array
SLURM_NNODES Number of nodes allocated to job

Each of these environment variables can be referenced from a SLURM batch script using the $ symbol before the name of the variable (e.g. echo $SLURM_JOBID). A full list of SLURM environment variables can be found here: http://slurm.schedmd.com/sbatch.html#lbAF

Temp space management while using SLURM on ACCRE

While working on slurm jobs, it is encouraged to the users that they store temporary files inside the /tmp directory.

Our unenforced rules for /tmp are: 1. Don't use more than 10GB per CPU core 2. Clean up after yourself, even in most cases of job failure

For cleanups, users can use the `trap` bash directive. Trap enables users to run certain functions or commands as the slurm job exits or quits. Alternatively, users can also use the pre-existing setup_accre_runtime_dir script to set a trap so that jobs clean up in most cases automatically.

Here is an example:

$ salloc
salloc: Pending job allocation 61966663
salloc: job 61966663 queued and waiting for resources
salloc: job 61966663 has been allocated resources
salloc: Granted job allocation 61966663
salloc: Waiting for resource configuration
salloc: Nodes cn1271 are ready for job
(base) [koirap1@cn1271 tmp]$ source setup_accre_runtime_dir
(base) [koirap1@cn1271 tmp]$ cd $ACCRE_RUNTIME_DIR
(base) [koirap1@cn1271 u906449.j61966663.hQk7]$ pwd
/tmp/u906449.j61966663.hQk7
(base) [koirap1@cn1271 u906449.j61966663.hQk7]$ exit
exit
salloc: Relinquishing job allocation 61966663
salloc: Job allocation 61966663 has been revoked.
(base) [koirap1@gw346 tmp]$ cd /tmp/u906449.j61966663.hQk7
-bash: cd: /tmp/u906449.j61966663.hQk7: No such file or directory

Notice how after the job allocation was revoked, we could no longer access the /tmp/u906449.j61966663.hQk7 directory

Here is the setup_accre_runtime_dir script's source in case the users need customization on top of it.

$ cat /accre/usr/bin/setup_accre_runtime_dir
Show quoted text
# and should be sourced, not executed.

cleanup_accre_runtime_dir()
{
  return_val=$?
  if [[ $ACCRE_RUNTIME_DIR == /tmp/u${UID}.j${SLURM_JOB_ID}* ]]; then
    rm -rf ${ACCRE_RUNTIME_DIR}
  fi
  exit ${return_val}
}

if [ -z "$ACCRE_RUNTIME_DIR" ]; then
  export ACCRE_RUNTIME_DIR=$(mktemp -d /tmp/u${UID}.j${SLURM_JOB_ID}.XXXX)
  trap 'cleanup_accre_runtime_dir' EXIT
fi