GPUs at ACCRE

From ACCRE Wiki

GPU types

All GPU types are placed into a single batch_gpu partition for resources where bursting and high throughput is desired, or an interactive_gpu partition for resources where immediate access is desired at the cost of being able to burst. All GPU nodes by default will be placed in the batch_gpu partition.

Accessing GPU Resources in the batch_gpu Partition

GPU resources in the batch_gpu partition are accessed using your group with the _acc suffix, for example if your group is accre_lab then the slurm account to use is accre_lab_acc. To see which groups have the _acc suffix, run the slurm_resources command:

[bob@gw344 accre]$ slurm_resources 

Accounts to use for accessing the batch (default) partition:

Account
------------------
accre
fe_accre_lab
accre_guests

Accounts and GPU types for accessing the batch_gpu partition:

Accounts           GPU Type
------------------ ----------------------------
fe_accre_lab_acc   nvidia_geforce_rtx_2080_ti
fe_accre_lab_acc   nvidia_rtx_a6000
fe_accre_lab_acc   nvidia_titan_x
accre_guests_acc   nvidia_geforce_rtx_2080_ti
accre_guests_acc   nvidia_rtx_a6000

You have access to accelerated GPU resources in the batch_gpu partition.
As a usage example, if you wanted to request 2 GPUs of type "nvidia_rtx_a6000"
for a job with account "accre_guests_acc" on the partition "batch_gpu",
then you would add the following lines to your SLURM script:

#SBATCH --account=accre_guests_acc
#SBATCH --partition=batch_gpu
#SBATCH --gres=gpu:nvidia_rtx_a6000:2

Note that you may not request more GPUs than your account is allowed.

The partition must be set to batch_gpu. This partition includes all GPU resources intended for batch usage (optimized for throughput, not latency, allowing bursting). To select your GPU type, instead of a generic #SBATCH --gres=gpu:4 for 4 A6000 GPUs you must select your specific GRES GPU type, for example:


#SBATCH --gres=gpu:nvidia_rtx_a6000:4. A job submission not specifying GPU type will be rejected.


GRES GPU types
GPU Type Architecture GPU Memory (GB) GRES Type CPU Cores per GPU Available System Memory per GPU (GB)
Nvidia H100 NVL Hopper 94 nvidia_h100_nvl 32 561
Nvidia L40S Ada Lovelace 48 nvidia_l40s 8 187
Nvidia A100 (80GB) Ampere 80 nvidia_a100_80gb 32 374
Nvidia RTX A6000 Ampere 48 nvidia_rtx_a6000 16 124
Nvidia RTX A4000 Ampere 16 nvidia_rtx_a4000 4 30
Nvidia Quadro RTX 6000 Turing 24 quadro_rtx_6000 6 92
Nvidia RTX 2080 Ti Turing 11 nvidia_geforce_rtx_2080_ti 6 92
Nvidia Titan X (or Xp) Pascal 12 nvidia_titan_x 2 60


Here is an example script for a user in the group accre_lab which has access to A6000 GPUs in the batch_gpu partition which is requesting 3 A6000 GPUs:

#!/bin/bash
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --time=00:30:00
#SBATCH --job-name=r9_gpu_test
#SBATCH --account=accre_lab_acc
#SBATCH --gres=gpu:nvidia_rtx_a6000:3
#SBATCH --partition=batch_gpu

setup_accre_software_stack
module load cuda/12.6 python/3.12.4 
./my_analysis_code.py

Accessing GPU Resources in the interactive_gpu Partition

GPU resources in the batch_gpu partition are accessed using your group with the _iacc suffix, for example if your group is accre_lab then the slurm account to use is accre_lab_iacc. To see which groups have the _iacc suffix, run the slurm_resources command:

[bob@gw344 accre]$ slurm_resources 

...snip...

Accounts and QOSs for accessing the interactive gpu partition:

Account             QOS                  CPU limit  Memory Limit (GiB) GPU Type            GPU Limit
------------------- ------------------- ---------- ------------------- ------------------ ----------
accre_iacc          debug_iacc                  24               372.0 nvidia_rtx_a4000            4
accre_iacc          fe_accre_lab_iacc            8                60.0 nvidia_rtx_a4000            1
accre_iacc          fe_accre_lab_iacc            8                60.0 nvidia_titan_x              1


Use the "qosstate [QOS_NAME]" command to get current usage for
the QOS specified by QOS_NAME with a breakdown of usage by user
and slurm account.

Note that the "debug_iacc" QOS is a special QOS available to all
cluster users for quick debugging of slurm scripts requiring a GPU.
On this special QOS there are additional restrictions of a maximum
wall clock time of 00:30:00 and each user can use a maximum of
cpu=6,gres/gpu:nvidia_rtx_a4000=1,gres/gpu=1,mem=93G at one time.

As an example, to submit a job to the interactive gpu partition using
the account "accre_iacc" and QOS "debug_iacc" you would add the following
lines to your slurm script:

#SBATCH --account=accre_iacc
#SBATCH --partition=interactive_gpu
#SBATCH --qos=debug_iacc
#SBATCH --gres=gpu:nvidia_rtx_a4000:1

Example of Running a GPU job

For a simple GPU job this example will use PyTorch provided by the Research Alliance of Canada software stack and an Nvidia RTX A4000 GPU which all users are provided short (30 minute) access to via the debug_iacc slurm QoS. An example python script which trains a sequential neural network has been adapted from the example code at Vanderbilt University's Machine Learning Training Facility.

First create the python training script as a file named train_neural_network.py with the following contents:

"""
This example trains a sequential neural network

Adapted from example code from Vanderbilt University's
Machine Learning Training Facility

https://docs.mltf.vu
"""
 
import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class SeqNet(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2,  output_size):
        super(SeqNet, self).__init__()

        self.lin1 = nn.Linear(input_size, hidden_size1)
        self.lin2 = nn.Linear(hidden_size1, hidden_size2)
        self.lin3 = nn.Linear(hidden_size2, output_size)


    def forward(self, x):
        x = torch.flatten(x,1)
        x = self.lin1(x)
        x = F.sigmoid(x)
        x = self.lin2(x)
        x = F.log_softmax(x, dim=1)
        out = self.lin3(x)
        return out

def train(model, train_loader, loss_function, optimizer, num_epochs):

    # Transfer model to device
    model.to(device)

    for epoch in range(num_epochs):

        running_loss = 0.0
        model.train()

        for i ,(images,labels) in enumerate(train_loader):
            images = torch.div(images, 255.)

            # Transfer data tensors to device
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = loss_function(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        average_loss = running_loss / len(train_loader)

        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {average_loss:.4f}')

    print("Training finished.")


input_size = 784
hidden_size1 = 200
hidden_size2 = 200
output_size = 10
num_epochs = 10
batch_size = 100
lr = 0.01


device = torch.device("cuda")
print("Training on device: ", torch.cuda.get_device_name(device))
my_net = SeqNet(input_size, hidden_size1, hidden_size2, output_size)
my_net = my_net.to(device)


optimizer = torch.optim.Adam( my_net.parameters(), lr=lr)
loss_function = nn.CrossEntropyLoss()

fmnist_train = datasets.FashionMNIST(root="data", train=True, download=True, transform=ToTensor())
fmnist_test = datasets.FashionMNIST(root="data", train=False, download=True, transform=ToTensor())

fmnist_train_loader = DataLoader(fmnist_train, batch_size=batch_size, shuffle=True)
fmnist_test_loader = DataLoader(fmnist_test, batch_size=batch_size, shuffle=True)

train(my_net, fmnist_train_loader, loss_function, optimizer, num_epochs)

correct = 0
total = 0
for images,labels in fmnist_test_loader:
  images = torch.div(images, 255.)
  images = images.to(device)
  labels = labels.to(device)
  output = my_net(images)
  _, predicted = torch.max(output,1)
  correct += (predicted == labels).sum()
  total += labels.size(0)

print('Accuracy of the model: %.3f %%' %((100*correct)/(total+1)))


Now create a file named train_neural_network.slurm with the following contents:

#!/bin/bash
#SBATCH --account=accre_iacc # Change to your group with an "_iacc" suffix
#SBATCH --partition=interactive_gpu
#SBATCH --qos=debug_iacc
#SBATCH --mem=8G
#SBATCH --cpus-per-task=2
#SBATCH --gpus=nvidia_rtx_a4000:1

# Setup python interpreter from the CC software stack
# Note that PyTorch has its own CUDA libraries so we
# do not need to load cuda/12.6
echo -e "Setting up virtual environment for training model...\n\n"
setup_accre_software_stack
ml python/3.12.4 scipy-stack/2025a

# Create a temporary directory on the local storage
# of the compute node and set up a python virtual environment
# and install PyTorch in the virtual environment
source setup_accre_runtime_dir
python -m venv ${ACCRE_RUNTIME_DIR}/venv
source ${ACCRE_RUNTIME_DIR}/venv/bin/activate
pip install torch torchvision

# Run the training script
echo -e "\n\nRunning the training script...\n\n"
python train_neural_network.py

You will need to change the line #SBATCH --account=... to an account that you have access to. Use the slurm_resources command to find a valid slurm account ending in "_iacc" for which you have access to the "debug_iacc" QoS. Note that all ACCRE users should have access to the "debug_iacc" QoS for debugging short jobs on a select set of GPU resources. If desired, you can change the QoS and GPU type to a different model if you are part of a group with access to those resources.

Finally, submit your slurm script with the command sbatch train_neural_network.slurm. The slurm scheduler will return a JobID number. You can check the status of your job with the command squeue -j [JOBID] where JOBID is your job id number. If your job does not run quickly it may be that other users are using all available GPUs for the debug QoS. To check the status of this QoS, use the command qosstate debug_iacc.

When your job is finished, you will see a file of the form slurm-NNN.out where NNN is the Job ID number of your job. The standard output from the model training will be written to this file.