GPUs at ACCRE

From ACCRE Wiki

GPU Partitions

Here below is the current ACCRE GPU partitions:

Partitions A6000x4 A6000x2 A4000x4 A4000x8 Turing Pascal
number of nodes 21 7 1 2 21 22
GPUs per node 4 (A6000) 2 (A6000) 4 (A4000) 8 (A4000) 4 (RTX 2080 Ti) 4 (TITAN Xp)
Cores 128 (Intel Platinum 8358) 24 (Intel E5-2620 v3) 48 (Intel Silver 4214R) 32 (EPYC 7313) 48 (Intel Gold 5118) 8 (Intel E5-2623 v4)
Host Memory 503GB 125GB 376GB 250GB 376GB 250GB
GPU Memory 52GB 52GB 16GB 16GB 12GB 14GB

The GPUs are hosted in condo mode. For example, a group purchased 2 GPU nodes in A6000x4 partition, then the group can use two nodes (8 GPU cards) in this partition. We have very limited GPU resources for public testing purpose, if you want to purchase GPU nodes and want to do testing first, please open a ticket with us so we can add you into testing group.

If your group can use GPU resources, you need to know your account and corresponding partition for accessing the GPUs. ACCRE provides a command, slurm_groups, that can tell you if GPU access is enabled for your login as well as which account and partition to use. For example:

[bob@gw344 accre]$ slurm_groups 

Accounts        Partitions  
--------------- ------------
accre           debug       
accre           devel       
accre           nogpfs      
accre           production  
accre_gpu_acc   a6000x2     
accre_gpu_acc   a6000x4     
accre_gpu_acc   maxwell     
accre_gpu_acc   pascal      
accre_gpu_acc   turing      

You have access to accelerated GPU resources.
As a usage example, if you wanted to request 2 GPUs for a job with
account "accre_gpu_acc" on the partition "turing",
then you would add the following lines to your SLURM script:

#SBATCH --account=accre_gpu_acc
#SBATCH --partition=turing
#SBATCH --gres=gpu:2

Running a GPU job

Users can request the desired amount of GPUs by using SLURM generic resources, also called gres. Each gres bundles together one GPU to multiple CPU cores (see table above) belonging to the same PCI Express root complex to minimize data transfer latency between host and GPU memory. The number of CPU cores requested cannot be higher than the sum of cores in the requested gres .

Below is an example SLURM script header to request 2 Pascal GPUs and 4 CPU cores on a single node on the pascal partition:

#!/bin/bash
#SBATCH --account=accre_gpu_acc
#SBATCH --partition=pascal
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem=20G
#SBATCH --time=2:00:00
#SBATCH --output=gpu-job.log

Note that you must be in one of the GPU groups on the cluster and specify this group from the job script in order to run jobs on the GPU cluster. The #SBATCH --partition=<pascal OR maxwell> line is also required in the job script.

The CUDA library can be accessed through the lmod:

[bob@gpu0058 ~]$ ml spider CUDA

-----------------------------------------------------------------------------------------------------------------------------------------
  CUDA:
-----------------------------------------------------------------------------------------------------------------------------------------
    Description:
      CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and
      implemented by the graphics processing units (GPUs) that they produce. CUDA gives developers access to the virtual instruction set
      and memory of the parallel computational elements in CUDA GPUs.

     Versions:
        CUDA/8.0.61
        CUDA/9.0.176
        CUDA/10.1.105
        CUDA/11.1.1
        CUDA/11.7.0
        CUDA/11.8.0

and based on the CUDA, there are a handful of GPU applications you can directly to use, for example alphaFold, Pytorch etc.

All jobs making use of a MPI distribution on the GPU nodes should use SLURM’s srun command rather than mpirun/mpiexec . Please refer to the MPI section in Parallel Processing and Job Arrays for more details.

For testing the pipeline on GPU nodes, we recommend launching an interactive job on one of the GPU nodes via salloc, for example:

salloc --partition=a6000x4 --account=<group> --gres=gpu:1 --time=4:00:00 --mem=20G