Parallel Processing and Job Arrays

From ACCRE Wiki

Introduction

Below are example SLURM scripts for jobs employing parallel processing. In general, parallel jobs can be separated into four categories:

  • Distributed memory programs that include explicit support for message passing between processes (e.g. MPI). These processes execute across multiple CPU cores and/or nodes.
  • Multithreaded programs that include explicit support for shared memory processing via multiple threads of execution (e.g. Posix Threads or OpenMP) running across multiple CPU cores.
  • Embarrassingly parallel analysis in which multiple instances of the same program execute on multiple data files simultaneously, with each instance running independently from others on its own allocated resources (i.e. CPUs and memory). SLURM job arrays offer a simple mechanism for achieving this.
  • GPU (graphics processing unit) programs including explicit support for offloading to the device via languages like CUDA or OpenCL.

It is important to understand the capabilities and limitations of an application in order to fully leverage the parallel processing options available on the ACCRE cluster. For instance, many popular scientific computing languages like Python , R , and Matlab now offer packages that allow for GPU or multithreaded processing, especially for matrix and vector operations.

MPI Jobs

Jobs running MPI (Message Passing Interface) code require special attention within SLURM. SLURM allocates and launches MPI jobs differently depending on the version of MPI used (e.g. OpenMPI or Intel MPI). We recommend using the most recent version of OpenMPI or Intel MPI available through Lmod to compile code and then using SLURM’s srun command to launch parallel MPI jobs. The example below runs MPI code compiled by GCC+ OpenMPI:

#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=3
#SBATCH --tasks-per-node=8     # 8 MPI processes per node
#SBATCH --time=7-00:00:00
#SBATCH --mem=4G     # 4 GB RAM per node
#SBATCH --output=mpi_job_slurm.log
module load GCC OpenMPI
echo $SLURM_JOB_NODELIST
srun ./test  # srun is SLURM's version of mpirun/mpiexec

This example requests 3 nodes and 8 tasks (i.e. processes) per node, for a total of 24 MPI tasks. By default, SLURM allocates 1 CPU core per process, so this job will run across 24 CPU cores. Note that srun accepts many of the same arguments as mpirun / mpiexec (e.g. -n <number cpus>) but also allows increased flexibility for task affinity, memory, and many other features. Type man srun for a list of options.

More information about running MPI jobs within SLURM can be found here here: http://slurm.schedmd.com/mpi_guide.html

Feel free to open a help desk ticket if you require assistance with your MPI job.

srun

This command is used to launch a parallel job step. Typically, srun is invoked from a SLURM job script to launch a MPI job (much in the same way that mpirun or mpiexec are used). Please note that your application must include MPI code in order to run in parallel across multiple CPU cores using srun . Invoking srun on a non-MPI command or executable will result in this program being independently run X times on each of the CPU cores in the allocation.

Alternatively, srun can be run directly from the command line on a gateway, in which case srun will first create a resource allocation for running the parallel job. The -n [CPU_CORES] option is passed to specify the number of CPU cores for launching the parallel job step. For example, running the following command from the command line will obtain an allocation consisting of 16 CPU cores and then run the command hostname across these cores:

srun -n 16 hostname

For more information about srun see: http://www.schedmd.com/slurmdocs/srun.html

Multithreaded Jobs

Multithreaded programs are applications that are able to execute in parallel across multiple CPU cores within a single node using a shared memory execution model. In general, a multithreaded application uses a single process (i.e. “task” in SLURM) which then spawns multiple threads of execution. By default, SLURM allocates 1 CPU core per task. In order to make use of multiple CPU cores in a multithreaded program, one must include the --cpus-per-task option.Below is an example of a multithreaded program requesting 4 CPU cores per task. The program itself is responsible for spawning the appropriate number of threads.

#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --nodes=1 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=4 # 4 threads per task 
#SBATCH --time=02:00:00 # two hours 
#SBATCH --mem=4G 
#SBATCH --output=multithread.out 
#SBATCH --job-name=multithreaded_example 
# Load the most recent version of GCC available through Lmod
module load GCC  
# Run multi-threaded application
./hello

Job Arrays

Job arrays are useful for submitting and managing a large number of similar jobs. As an example, job arrays are convenient if a user wishes to run the same analysis on 100 different files. SLURM provides job array environment variables that allow multiple versions of input files to be easily referenced. In the example below , three input files called vectorization_0.py , vectorization_1.py , and vectorization_2.py are used as input for three independent Python jobs:

#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --time=2:00:00
#SBATCH --mem=2G
#SBATCH --array=0-2
#SBATCH --output=python_array_job_slurm_%A_%a.out

echo "SLURM_JOBID: " $SLURM_JOBID
echo "SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
echo "SLURM_ARRAY_JOB_ID: " $SLURM_ARRAY_JOB_ID

# Load Anaconda distribution of Python
module load Anaconda2
python vectorization_${SLURM_ARRAY_TASK_ID}.py

The #SBATCH --array=0-2 line specifies the array size (3) and array indices (0, 1, and 2). These indices are referenced through the SLURM_ARRAY_TASK_ID environment variable in the final line of the SLURM batch script to independently analyze the three input files. Each Python instance will receive its own resource allocation; in this case, each instance is allocated 1 CPU core (and 1 node), 2 hours of wall time, and 2 GB of RAM.

It is important to remember that the resources specified in the batch job script header are not shared across all the jobs generated by the array. Instead, they represent the resources allocated for each job in the array. The only limits to the number of jobs in an array that can run at any time are the user’s group bursting limits and the amount of available resources on the cluster. To prevent the use of all the resources available for a given group, the % operator can be used in the --array= option to indicate the maximum number of running jobs allowed for the array. For example, with --array=0-100%4 Slurm will not allow more than four jobs in the array to run concurrently.

The --array= option is flexible in terms of the index range and stride length. For instance, --array=0-10:2 would give indices of 0, 2, 4, 6, 8, and 10.

The %A and %a variables provide a method for directing standard output to separate files. %A references the SLURM_ARRAY_JOB_ID while %a references SLURM_ARRAY_TASK_ID. SLURM treats job ID information for job arrays in the following way: each task within the array has the same SLURM_ARRAY_JOB_ID, and its own unique SLURM_JOBID and SLURM_ARRAY_TASK_ID. The JOBID shown from squeue is formatted by SLURM_ARRAY_JOB_ID followed by an underscore and the SLURM_ARRAY_TASK_ID.

While the previous example provides a relatively simple method for running analyses in parallel, it can at times be inconvenient to rename files so that they may be easy indexed from within a job array. The following example provides a method for analyzing files with arbitrary file names, provided they are all stored in a sub-directory named data :


#!/bin/bash
#SBATCH --mail-user=myemail@vanderbilt.edu
#SBATCH --mail-type=ALL
#SBATCH --ntasks=1
#SBATCH --time=2:00:00
#SBATCH --mem=2G
#SBATCH --array=1-5   # In this example we have 5 files to analyze
#SBATCH --output=python_array_job_slurm_%A_%a.out
arrayfile=`ls data/ | awk -v line=$SLURM_ARRAY_TASK_ID '{if (NR == line) print $0}'`
module load Anaconda2 # load Anaconda distribution of Python
python data/$arrayfile

More information can be found here: http://slurm.schedmd.com/job_array.html