SLURM

From ACCRE Wiki

SLURM (Simple Linux Utility for Resource Management) is a software package for submitting, scheduling, and monitoring jobs on large compute clusters.  This page details how to use SLURM for submitting and monitoring jobs on the ACCRE computing cluster. New cluster users should consult our Getting Started pages Submit First Job, which is designed to walk you through the process of creating a job script, submitting a job to the cluster, monitoring jobs, checking job usage statistics, and understanding our cluster policies.

This page describes the structure of the SLURM setup at ACCRE and the basic commands of SLURM. For more advanced topics, see the page on Parallel Processing and Job Arrays. ACCRE staff have also created a set of commands (see the page Commands for Job Monitoring) to assist you in scheduling and managing your jobs.

Cluster Structure and Partitions

Computing resources in the ACCRE cluster are organized into four partitions specified by use case and type of hardware. A partition in SLURM can be thought of as a job queue with an assortment of constraints for jobs that may be executed. Priority-ordered jobs are allocated within a partition until the resources (nodes, processors, memory, etc.) within that partition are exhausted.

The primary partition in the ACCRE cluster is the "batch" partition. This partition is the default partition that SLURM jobs will be submitted to when a partition is not specified. Groups that have contributed resources to ACCRE via renting CPU cores or node support will gain a corresponding fairshare of the cluster. This fairshare is an integer value which at ACCRE corresponds to the number of physical CPU cores that have been contributed to the cluster. Under normal operational circumstances, assuming a group is routinely submitting jobs to the batch partition, a group should be able to average their fairshare in total CPU cores utilized by all their running jobs.

As is typical for High Performance and High Throughput Computing clusters, the batch partition allows for groups to burst well beyond their contributed fairshare. For example, a group that had contributed 12 physical CPU-cores would often be able to concurrently run jobs utilizing 50 or more cores at once. This burst capacity allows much greater throughput for batch scheduled jobs, but as a consequence a user may need to wait for several hours or in some cases a few days before their jobs begin to run.

The batch partition is set up to optimize throughput over all cluster groups according to fairshare over a time period of several days to weeks. However, for some workflows immediate access may be desired. For groups needing access to computing resources on-demand, without having to wait in the queue, ACCRE now provides an "interactive" partition. In order to ensure prompt access for all group workflows, there is no bursting allowed in the interactive partition, and the computing resources in the partition are deliberately overprovisioned so that there will be sufficient CPU cores available for all groups up to the amount that they have contributed. Instead of fairshare determining access priority, in the interactive partition resource access is governed by a SLURM Quality of Service (QoS). When a group contributes resources to the interactive partition, they are allocated a QoS with a limit according to the CPUs and memory that they have contributed. The group has control over which slurm accounts will be permitted to access their QoS, and a group may elect to split their contributed resources into multiple QoSs. When jobs are submitted to the interactive partition using a specified QoS, under normal operational circumstances they will run immediately unless the group has exceeded the limits imposed on their QoS. ACCRE provides helpful utilities for groups to monitor the usage of their QoS by group members.

Compute nodes with GPU hardware are managed in their own corresponding partitions "batch_gpu" and "interactive_gpu". The design of these partitions is similar to "batch" and "interactive", except that resources are limited by the type and number of GPUs contributed by a group. For information about these partitions, please refer to GPUs at ACCRE.

Limitations and Advantages of the Batch Partition

Because the batch partition allows groups to "burst" and temporarily allow their usage to exceed the resources that they have contributed to the cluster, this means that ACCRE can not guarantee that underserved groups will immediately see large jobs or job arrays begin. The batch and batch gpu partitions are designed to maximize time-averaged usage and throughput, and as a result an underserved group may on occasion see few jobs starting for a day or two even though their submitted arrays have higher priority. The scheduler cannot begin new jobs until existing jobs complete, and as ACCRE allows up to a 14-day runtime for jobs it may take some time for resources to become available.

Please understand that ACCRE cannot provide any guarantees concerning the availability of bursting and reserves the right to adjust burst limits above fairshare as current resources allow to maximize overall resource utilization. In addition to burst limits on the total number of physical CPU cores, ACCRE will also place limits on total memory usage and requested wall-clock time of aggregate running jobs. If most or all groups that have contributed significant resources are heavily using their resources, then bursting may be extremely limited. However, we have found by experience that the most frequent operational state of the cluster is such that without bursting many resources will remain idle and that by allowing bursting the shared resource pool works more efficiently to maximize the computational work done. While our primary aim is to optimize the system for maximum utilization over time, we also try to adjust things so that workloads generally start working at scale within 48 hours of submission, although this is impossible to guarantee in all situations.

If ensuring that workloads start promptly after submission is considerably more important to a group than allowing for bursting beyond their contributed resources and overall throughput, then we suggest the group leader contact ACCRE management to shift their contributed resources from the batch partition to the interactive partition.

Policy for high memory usage in the batch partitions

One other policy regarding the batch partition regards high-memory jobs. At ACCRE, the total system memory of compute nodes varies, but most nodes in the infrastructure have between 6 and 8 GiB of installed system memory per physical CPU core. If users were to submit enough jobs asking for only one physical CPU core but over 100 GiB of memory, then the system would be unable to meet the fairshare expectations of all of our users, as most remaining physical CPU cores would be effectively "memory starved" with no memory left on the system to be reasonably allocated to other jobs. In order to prevent memory starvation, ACCRE imposes a maximum memory request of 20 GiB per physical CPU core. If a job exceeds this maximum, the job request will automatically be adjusted to add additional CPU cores. For example, if a job requires 100 GiB of memory and only one physical CPU core, then the scheduler will automatically adjust the job request to 5 physical CPU cores and count this against the account fairshare priority and burst limit accordingly.

SLURM Account structure at ACCRE

Each SLURM job is submitted under a specific SLURM account which a user has access to. At ACCRE, SLURM accounts generally correspond to linux groups, and a user's primary linux group is typically set as the user's default SLURM account of the same name. If a user does not specify an account in their job submission, the default account will be used. In each lab, the principle investigator or lab leader has full authority over which users may have access to the SLURM account by requesting that the user be added or removed from the corresponding group or approving such a request.

Each SLURM account corresponding to a group is organized under a special SLURM account which at ACCRE is known as a "parent account". No users are members of parent accounts and parent accounts only exist to provide lab leaders with the ability to share and restrict resources among multiple groups. Jobs may not be submitted directly under parent accounts. At ACCRE, an account that is a parent account has a suffix "_account" at the end of its name.

In the batch partition, the lab fairshare corresponding to the number of physical CPU cores that the lab has contributed to the cluster is set to their parent account, and it is this fairshare that determines job priority relative to jobs submitted by other labs. Typically accounts within a parent account have their fairshare set to one, as the fairshare of these group accounts only affects job priority relative to other accounts under that parent account, and not relative to the jobs of other labs.

This relationship may be better understood by an example. Consider three researchers Alice, Bob, and Charles who each are the principle investigators of their respective labs, and who have each contributed physical CPU cores to the batch partition at ACCRE. The table below shows the fairshare and CPU burst limits of each set of accounts.

PARENT ACCOUNT       GROUP ACCOUNT        FAIRSHARE         CPU CORE LIMIT
---------------------- ----------------------------------------------------
alice_lab_account                         800               1400
                     alice_lab            9                 -
                     alice_astro_class    1                 -
bob_lab_account                           400               1000
                     bob_lab              1                 -
                     bob_lab_imaging      1                 600
charles_lab_account                       200               800
                     charles_lab          1                 -

The simplest lab in this example is Charles' lab, which has a fairshare of 200 and only one group underneath their parent account. All of Charles' lab members have access to the charles_lab linux group and slurm account and share equally in the fairshare of 200 cores. The fairshare of 1 for the charles_lab is irrelevant as it is the only group under the parent account. The effective CPU burst limit for charles_lab is 800 as it is inherited from the parent account.

Bob has decided to add an additional account under his parent account for a special imaging project that has significant CPU requirements and will submit a lot of jobs. He has chosen to name this special group bob_lab_imaging and ACCRE has approved this group name. Both groups have a fairshare of 1, which means that they are prioritized equally relative to one another and share equally in the parent account's fairshare of 400. However, Bob's lab members have run into delays as the imaging project jobs often run into the paren't accounts burst limit of 1000 CPU cores, so Bob has reqested that ACCRE set a voluntary burst limit of 600 cores on the bob_lab_imaging account so that jobs submitted from the imaging project alone will never consume all resources potentially available to Bob's lab.

In the case of Alice's lab, Alice would also like her resources to be available to her Astronomy class students, so she has asked ACCRE to set up a separate alice_astro_class group for her students to request ACCRE accounts under. She would like for her research work to generally take priority over jobs submitted for the class, so she has set the fairshare for alice_lab to 9 and alice_astro_class to 1. These group account fairshares will only change the priority of jobs within Alice's two groups relative to one another. All jobs submitted by either group will be treated as having a fairshare of 800 in relation to jobs submitted from other labs.

Lab administrators can also use multiple groups under their parent account to control access to other resources. They can specify which groups should be able to access their batch GPU resources if they have contributed any, and if they have any interactive QoSs, they can limit access to each QoS to the groups of their choosing under their parent account.

Account Suffixes and Partitions

Slurm determines job priority for an account by considering recent resource usage over all partitions in a cluster. However, researchers at Vanderbilt have shown a preference to consider resource usage in each partition separately. In other words, recent CPU usage in the "batch" partition should not affect priority in the "batch_gpu" or "interactive" partition. In order to track resource usage separately in each partition, a separate slurm account is created for each major partition and designated with a suffix.

For example, if the group bob_lab has resources in "batch", "batch_gpu", "interactive", and "interactive_gpu" ACCRE will create four different slurm accounts corresponding to the group bob_lab. The slurm account bob_lab is used to submit jobs to the "batch" partition, the account bob_lab_acc is used to submit jobs to the "batch_gpu" partition, the slurm account bob_lab_int is used to submit jobs to the "interactive" partition, and the slurm account bob_lab_iacc is used to submit jobs to the "interactive_gpu" partition. This way cluster usage in each partition will be tracked completely separately and a group's "batch_gpu" usage will have no effect on the priority of their "batch" jobs. The slurm_resources command will guide users on what account to specify for each partition, GPU type, and/or QoS for their jobs. Note that ACCRE considers bob_lab to effectively be the same group as bob_lab_acc and bob_lab_int in terms of group membership - the users in each of these slurm accounts is always the same.

Batch Scripts

The first step for submitting a non-interactive job to SLURM is to write a batch script, as shown below. The script includes a number of #SBATCH directive lines that tell SLURM details about your job, including the resource requirements for your job. For example, the example below is a simple Python job requesting 1 node, 1 task (process), 2 physical CPU cores, 8GB of RAM, and 2 hours of wall time.

#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1    # comments allowed 
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=02:00:00
#SBATCH --mem=8G
#SBATCH --partition=batch
#SBATCH --account=accre_lab
#SBATCH --output=python_job_slurm.out

# These are comment lines
# Load python 3.12 and the scipy-stack, which is a bundle
# of many of popular scientific computing tools like
# numpy, pandas, and matplotlib
setup_accre_software_stack
module load python/3.12.4 scipy-stack/2025a

# Pass your Python script to the python intepreter for execution
python my_analysis_code.py

Note that the batch partition is used by default if no partition is specified. To see a list of valid slurm accounts with access to the batch partition for your user, use the "slurm_resources" command.

A SLURM batch script must begin with the #!/bin/bash directive on the first line. The subsequent lines begin with the SLURM directive #SBATCH followed by a resource request or other pertinent job information. Email alerts will be sent to the specified address when the job begins, aborts, and ends. Below the #SBATCH directives are the Linux commands needed to run your program or analysis. Once your job has been submitted via the sbatch command (details shown below), SLURM will match your resource requests with idle resources on the cluster, run your specified commands on one or more compute nodes, and then email you (if requested in your batch script) when your job begins, ends, and/or fails.

The example above describes a single job to be run on a single node (server) intended to execute a single process, which may use multiple CPU cores if it is able through multithreading. For information about submitting jobs that run multiple processes (tasks) in concert or arrays of independent jobs submitted with a single script, see Parallel_Processing_and_Job_Arrays

Here is a list of basic #SBATCH directives:

#SBATCH Directive Description
--nodes=[count] Node count
--tasks-per-node=[count] Processes per node
--ntasks=[count] Total processes (across all nodes)
--cpus-per-task=[count] CPU cores per process
--time=[min] or --time=[dd-hh:mm:ss] Wall clock limit
--mem=[count] RAM per node
--output=[file_name] Standard output file
--error=[file_name] Standard error file
--array=[array_spec] Launch job array
--mail-user=[email_address] Email for job alerts
--mail-type=[BEGIN or END or FAIL or REQUEUE or ALL] Email alert type
--account=[account] Account to charge
--depend=[state:job_id] Job dependency
--job-name=[name] Job name
--constraint=[attribute] Request node attribute (skylake, sandy_bridge, haswell, csbtmp etc.)
--partition=[name] Submit job to specified partition (batch, interactive etc.)

Note that the --constraint option allows a user to target certain processor families.

Interactive Jobs

In many non-traditional use cases and for rapid debugging users may wish to run jobs interactively or submit batch scripts intended to be run immediately to allow for rapid analysis and development. The interactive partition is designed for this use case, but it is also possible to run interactive jobs in the batch partition, although there will be times that resources are not immediately available and one must wait for jobs to run.

When submitting jobs to the interactive partition one must use a separate slurm account from the one used to submit to the batch partition. Accounts that are allowed to submit to the interactive partition will end with the suffix "_int" and accounts that are allowed to submit to the interactive_gpu partition will end with the suffix "_iacc". The primary reason for separate accounts for different partitions is to ensure that usage in one partition does not affect job priority in a different partition. You can use the "slum_resources" command to see what accounts are available to your user.

Any job submitted to the interactive partition must specify a QoS to use. The "slurm_resources" command will tell you what accounts and QoSs are available to your user. The "qosstate" command will tell you the current usage and resource availability of the QoSs that your user has access to.

In addition to group QoSs that are created for use of contributed resources, ACCRE has set aside a small pool of resources for all users to use for interactive debugging and test jobs. The interactive debug QoS is named "debug_int" and has a special per-user restriction so that each user may not use more than 16 physical CPU cores and 192GB of memory at a time, and limits jobs to no more than 30 minutes. There is also an interactive_gpu debug QoD named "debug_iacc" with a similar per-user restriction, allowing access to one A4000 GPU per user for up to 30 minutes.

The "salloc" command will launch an interactive shell session in a job from the command line. For example, the following will launch a job using the debug_int QoS on the interactive partition with 3 physical CPU cores, 6 GB of memory, and a time limit of 15 minutes:

[bob@gw01 ~]$ salloc --partition=interactive --account=bob_lab_int --qos=debug_int --mem=6G --cpus-per-task=3 --time=00:15:00
salloc: Pending job allocation 2133286
salloc: job 2133286 queued and waiting for resources
salloc: job 2133286 has been allocated resources
salloc: Granted job allocation 2133286
salloc: Waiting for resource configuration
salloc: Nodes cn1287 are ready for job
[appelte1@cn1287 ~]$

Note that if you leave out the --time directive you will get the default of 30 minutes. To run this example as your user, you will need to replace "bob_lab_int" with your own interactive slurm account, use the "slurm_resources" command to see what accounts are valid for your user on the interactive partition.

You can also submit batch scripts to the interactive partition just as you would with batch, and if resources are available to your specified QoS then the scheduler will initiate your job immediately on its next pass, which will typically be within 60 seconds or so. The equivalent #SBATCH directives to the above "salloc" command would be:

#SBATCH --partition=interactive
#SBATCH --account=bob_lab_int
#SBATCH --qos=debug_int
#SBATCH --mem=6G
#SBATCH --cpus-per-task=3
#SBATCH --time=00:15:00

In addition to these options, ACCRE users have access to graphical interactive desktops, JupyterLab servers, and RStudio servers run as SLURM jobs via the ACCRE Visualization_Portal.

If your group has access to a dedicated QoS besides the debug QoS, you can also create a bash alias on your workstation to launch a job using that QoS via ssh. This will create a user experience that mimics logging into a private group server, even though you are actually accessing a pool of resources. An example such alias is as follows, where in this example the username is bob slurm account is bob_lab_int, and the QoS is bob_lab_2_int, and as a user you typically want to dedicate 16GB of memory and 4 physical CPU cores to your shell session. You can replace these values with ones that make sense for your group and name the alias whatever you prefer:

alias my_accre_int='ssh -t bob@login.accre.vu "qosstate bob_lab_2_int; salloc --time=7-00:00:00 --mem=16G --cpus-per-task=4 --account=bob_lab_int --partition=interactive --qos=bob_lab_2_int"'

This can be placed in your .bashrc file if desired. Then when you run the command my_accre_int and enter your VUNetID credentials you will be launched into a shell session using your QoS resources that will last up to seven days or until you exit. Prior to beginning the interactive session, it will also print out the current usage of your group QoS.

If you require X11 forwarding you can adjust the above to:

alias my_accre_int='ssh -Y -t bob@login.accre.vu "qosstate bob_lab_2_int; salloc --x11 --time=7-00:00:00 --mem=16G --cpus-per-task=4 --account=bob_lab_int --partition=interactive --qos=bob_lab_2_int"'

Basic Slurm Commands

SLURM offers a number of helpful commands for tasks ranging from job submission and monitoring to modifying resource requests for jobs that have already been submitted to the queue. Below is a list of SLURM commands:

Slurm Commands
sbatch [job_script] Job submission
squeue Job/Queue status
scancel [JOB_ID] Job deletion
scontrol hold [JOB_ID] Job hold
scontrol release [JOB_ID] Job release
sinfo Cluster status
salloc Launch interactive job
srun [command] Launch (parallel) job step
sacct Displays job accounting information


sbatch

The sbatch command is used for submitting jobs to the cluster. sbatch accepts a number of options either from the command line, or (more typically) from a batch script. An example of a SLURM batch script (called simple.slurm ) is shown below:

#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=0-00:15:00     # 15 minutes
#SBATCH --output=my.stdout
#SBATCH --job-name=just_a_test

# Put commands for executing job below this line
# This example is loading a Python interpreter module and
# writing out the version of Python
setup_accre_software_stack
module load python/3.12.4
python --version

To submit this batch script, a user would type:

sbatch simple.slurm

This job (called just_a_test ) requests 1 compute node, 1 task (by default, SLURM will assign 1 physical CPU core per task), 1 GB of RAM per CPU core, and 15 minutes of wall time (the time required for the job to complete). Note that these are the defaults for any job, but it is good practice to include these lines in a SLURM script in case you need to request additional resources.

Optionally, any #SBATCH line may be replaced with an equivalent command-line option. For instance, the #SBATCH --ntasks=1 line could be removed and a user could specify this option from the command line using:

sbatch --ntasks=1 simple.slurm

The commands needed to execute a program must be included beneath all #SBATCH commands. Lines beginning with the # symbol (without /bin/bash or SBATCH) are comment lines that are not executed by the shell. The example above simply prints the version of Python loaded in a user’s path. It is good practice to include any module load commands in your SLURM script. A real job would likely do something more complex than the example above, such as read in a Python file for processing by the Python interpreter.

For more information about sbatch see: http://slurm.schedmd.com/sbatch.html

squeue

squeue is used for viewing the status of jobs. By default, squeue will output the following information about currently running jobs and jobs waiting in the queue: Job ID, Partition, Job Name, User Name, Job Status, Run Time, Node Count, and Node List. There are a large number of command-line options available for customizing the information provided by squeue . Below are a list of examples:


Command Meaning
squeue --long Provide more job information
squeue --user=USER_ID Provide information for USER_ID’s jobs
squeue --account=ACCOUNT_ID Provide information for jobs running under ACCOUNT_ID
squeue --Format=account,username,numcpus,state,reason Customize output of squeue
squeue --states=running Show running jobs only
scontrol update dependency=JOB_ID Add job dependency so that job only starts after JOB_ID completes
scontrol --help Show all options

For more information about squeue see: http://slurm.schedmd.com/squeue.html

sacct

This command is used for viewing information for completed jobs. This can be useful for monitoring job progress or diagnosing problems that occurred during job execution. Our script rtracejob is just a wrapper of the sacct command, hence you can use rtracejob to grasp the information output by sacct.

By default, sacct will report Job ID, Job Name, Partition, Account, Allocated CPU Cores, Job State, and Exit Code for all of the current user’s jobs that completed since midnight of the current day. Many options are available for modifying the information output by sacct :

Commands Meaning
sacct --starttime 2022-12-04T00:00:00 Show the jobs since midnight of Dec 4, 2022 at 00:00:00
sacct --accounts=ACCOUNT_ID Show information for all users under ACCOUNT_ID
sacct --format=User,Timelimit,nodelist,ReqTRES,AllocTRES Show listed job information

It's worthy to note that --accounts can only display your account data. For example, if you are CSB users, you can only display CSB job information. If you try to display other account job archive data with this option, no information returns.

The --format option is particularly useful, as it allows a user to customize output of job usage statistics. Our rtracejob script is also based on this sacct outputs. For instance, the elapsedand Timelimit arguments allow for a comparison of allocated vs. actual wall time. ReqTRES and AllocTRES tracks the requested and allocated resources, such as cpu cores, memory level, GPUs.

For more information about sacct see: http://slurm.schedmd.com/sacct.html

scontrol

scontrol is used for monitoring and modifying queued jobs, as well as holding and releasing jobs. One of its most powerful options is the scontrol show job option. Below is a list of useful scontrol commands:

Command Meaning
scontrol show job JOB_ID Show information for queued or running job
scontrol release job JOB_ID Release hold on job
scontrol hold JOB_ID Place hold on job
scontrol show nodes Show hardware details for nodes on cluster
scontrol update JobID=JOB_ID Timelimit=1-12:00:00 Change wall time to 1 day 12 hours
scontrol update dependency=JOB_ID Add job dependency so that job only starts after JOB_ID completes
scontrol --help Show all options

Please note that the time limit or memory of a job can only be adjust for pending jobs, not for running jobs.

For more information about scontrol see: http://slurm.schedmd.com/scontrol.html

salloc

The function of salloc is to launch an interactive job on compute nodes. This can be useful for troubleshooting/debugging a program or if a program requires user input. To launch an interactive job requesting 1 node, 2 physical CPU cores total for 2 processes, and 1 hour of wall time, a user would type:

salloc --nodes=1 --ntasks=2 --time=1:00:00

This command will execute and then wait for the allocation to be obtained. Once the allocation is granted, an interactive shell is initiated on the allocated node (or one of the allocated nodes, if multiple nodes were allocated). At this point, a user can execute normal commands and launch his/her application like normal.

Note that all of the sbatch options are also applicable for salloc , so a user can insert other typical resource requests, such as memory. Another useful feature in salloc is that it enforces resource requests to prevent users or applications from using more resources than were requested. For example:

[bob@vmps12 ~]$ salloc --nodes=1 --ntasks=2 --time=1:00:00
salloc: Pending job allocation 1772833
salloc: job 1772833 queued and waiting for resources
salloc: job 1772833 has been allocated resources
salloc: Granted job allocation 1772833
[bob@vmp586 ~]$ hostname
vmp586
[bob@vmp586 ~]$ srun -n 2 hostname
vmp586
vmp586
[bob@vmp586 ~]$ srun -n 4 hostname
srun: error: Unable to create job step: More processors requested than permitted
[bob@vmp586 ~]$ exit
exit
srun: error: vmp586: task 0: Exited with exit code 1
salloc: Relinquishing job allocation 1772833
salloc: Job allocation 1772833 has been revoked.
[bob@vmps12 ~]$

In this example, srun -n 4 failed because only 2 tasks were allocated for this interactive job. Also note that typing exit during the interactive session will kill the interactive job, even if the allotted wall time has not been reached.

For more information about salloc see: http://slurm.schedmd.com/salloc.html

srun

Finally, srun is used to create job arrays for parallel processing. More information about srun is available in Parallel Processing and Job Arrays.

Environmental Variables

Variables Meaning
SLURM_JOBID Job ID
SLURM_JOB_NODELIST Names of nodes allocated to job
SLURM_ARRAY_TASK_ID Task id within job array
SLURM_NNODES Number of nodes allocated to job

Each of these environment variables can be referenced from a SLURM batch script using the $ symbol before the name of the variable (e.g. echo $SLURM_JOBID). A full list of SLURM environment variables that are set for a job can be found here: http://slurm.schedmd.com/sbatch.html#lbAK

Using Temporary Local Disk Storage in SLURM Jobs

Users are welcome to temporarily store small files on the /tmp directory of the local disk of the compute node where their job is running. Copying files to /tmp on the local node can in many cases result in dramatically improved job performance. For example, creating a temporary python virtual environment inside /tmp may take a minute or two of job overhead at the beginning of a job, but may be much faster than using an existing virtual environment on the network filesystem serving /home or /data. The reason for this is that network filesystems require file locking mechanisms and network communication that can be avoided when accessing the local disk.

The rules for using /tmp on a compute node are as follows:

1. Users should not exceed 10GB of disk usage per physical CPU core allocated to their job

2. Users are responsible for ensuring that the file permissions are secure so that other users cannot access their temporary files

3. Users must delete all temporary files prior to job completion even in the case of job failure.

To cleanup files even in the case of job failure, users can use the `trap` bash directive in their batch submission script. However, as this is an advanced shell scripting concept, ACCRE provides a script setup_accre_runtime_dir to automatically set up a secure temporary directory and set a trap to automatically delete the directory and all files within upon job completion, even if the job fails in most circumstances. This script should be run using the "source" shell builtin command as source setup_accre_runtime_dir as it will set an environment variable $ACCRE_RUNTIME_DIR with the path to the newly created temporary directory that can then be used later in the batch submission script or interactive session.

Here is an example of using the setup_accre_runtime_dir script in an interactive job:

[appelte1@gw00 ~]$ salloc --account=accre_int --partition=interactive --qos=debug_int
salloc: Pending job allocation 2136256
salloc: job 2136256 queued and waiting for resources
salloc: job 2136256 has been allocated resources
salloc: Granted job allocation 2136256
salloc: Waiting for resource configuration
salloc: Nodes cn1287 are ready for job
[appelte1@cn1287 ~]$ source setup_accre_runtime_dir
[appelte1@cn1287 ~]$ cd $ACCRE_RUNTIME_DIR
[appelte1@cn1287 u111694.j2136256.luLc]$ pwd
/tmp/u111694.j2136256.luLc
[appelte1@cn1287 u111694.j2136256.luLc]$ exit
exit
salloc: Relinquishing job allocation 2136256
[appelte1@gw00 ~]$

Maximum Jobs in the Slurm Queue

Generally speaking, ACCRE users are permitted and encouraged to submit work in reasonably large arrays. However, Slurm is primarily designed as a High Performance Computing scheduler and is not very effective at handling very large quantities of short low-resource jobs. For this reason Slurm clusters have a maximum number of total jobs permitted on the queue. The default maximum is 10,000 jobs, but at ACCRE we have configured Slurm to accept up to 500,000 jobs. Job counts in excess of several hundred thousand can cause the performance of the scheduler to suffer and may therefore impact to the work of other users.

ACCRE has adopted a policy that each individual user should not enqueue more than 30,000 jobs at a time without special permission from ACCRE management. If you believe that you have a legitimate use case that requires such a large number of individual jobs, please open a helpdesk ticket to start a discussion about your workflow, but note that ACCRE may not be able to accommodate all such requests.

Running a large number of very short jobs is highly inefficient as the Slurm scheduler requires an amount of time to set up resources which becomes significant for a job that will last only a few minutes. For this reason we recommend that users design their workflows to run individual jobs for 12 to 48 hours. If your workflow is naturally divided into samples that will only require a few minutes to run, then the best thing to do is to organize them into batches so that each job may run over several samples in a single batch and will therefore run for several hours. One can also request additional CPU cores and run multiple samples in parallel within a single job. On the other extreme, if your workflow requires over 48 hours, we strongly recommend dividing it into multiple stages and running each stage in a shorter job.