Technical Details

From ACCRE Wiki

This page describes in detail the components and operation of the compute cluster at ACCRE. For researchers requiring text for grant proposals and publications, we also provide a less technical Summary of the ACCRE Facility.

The compute cluster currently consists of more than 600 Linux systems with eight to sixteen processor cores each. Each node has at least 24 GB of memory (most have at least 128 GB) and dual Gigabit copper ethernet ports.

By the numbers

Last updated January 2024.
Number Description
3,100 Number of researchers
50 Number of campus departments and centers
5 Number of schools
737 Number of compute nodes
~15,960 Total number of cores
24-1028GB Memory available per node
~4.8PB Disk space available through shared cluster storage
22PB Disk space available through LStore

Details of the Cluster Design

The design reflects significant input from the investigators who use it in their research and education programs. Factors such as the number of nodes, the amount of memory per node, and the amount of disk space for storing user data is determined by demand. A schematic diagram of the system is shown below followed by a glossary of terms used in a subsequent description of the cluster:

ACCRE Cluster Diagram
ACCRE Cluster Diagram


Bandwidth
The rate at which data can be transferred over the network. Typically this is measured in Megabits/sec (Mbps) or Gigabits/sec (Gbps).
Compute Node
A node whose purpose is to run user applications. Users will not have direct access to these machines. Access will be granted solely through the job scheduler.
Disk Server
A machine that is used for data storage. Users will not normally have access to these machines. For more information see the description of GPFS below.
Gateway or Management Node
Computer designed for interactive login use. Users log in to these machines for compiling, editing, and debugging of their programs and to submit jobs.
Gigabit Ethernet
Commodity networking typically found in servers and has a bandwidth of 1 Gbps and latency of up to 150 microseconds in Linux. Similarly, 10 gigabit ethernet is high performance networking with a bandwidth of 10 Gbps.
Latency
The amount of time required to “package” data for sending over the network.

Cluster Connectivity

The ACCRE cluster has a flexible network topology which can accommodate different users and their needs while leaving room for expansion. There are two separate functional networks: (1) external network for connectivity to the outside world, and (2) a private internal cluster network.

The gateways are connected to the Internet through the external network comprised of gigabit ethernet links to one of our core 10 gigabit or 100 gigabit switches. Vanderbilt University has a 100 gigabit connection to the Internet via ACCRE. The internal cluster network is used for both file system traffic (ACCRE GPFS, LStore,) and compute job network traffic. Regular compute nodes each have a gigabit connection to the internal cluster network and only that network for security though they can initial bidirection connectivity outside ACCRE via two 10 gigabit NAT systems. Some nodes, such as Big Data nodes, have 10 or 25 gigabit internal network connections. The internal cluster network has a core with spine-leaf architecture composed of many 40 and 100 gigabit links for bandwidth scalability and fault tolerance plus classical fat tree design outside the core so nodes near each other do not compete for bandwidth. Only a constant incremental cost per machine is incurred for each additional machine.

Node installation, maintenance updates, and health monitoring

Initial OS installation is accomplished using Kickstart and CFEngine. We make an image of the configured and operational system, subsequently replicating this image across all nodes. Multiple images for different compute node configurations are stored on an infrastructure node. By using this technique it becomes possible to perform a wipe and re-install of the entire cluster in a short period of time. Similarly, Kickstart and CFEngine are also used to update a node, transferring only the data needed for the update instead of the entire image. Because of this intelligent update, most common maintenance operations, such as updating the kernel, take mere minutes.

The health of the compute nodes are monitored through the use of the open source package Nagios which supports distributed monitoring. Distributed monitoring allows data from individual management nodes to be collected on a single node. Nagios uses a client-server architecture and is written in the C programming language.

Compute cluster

Here below are the compute nodes in the GPFS cluster:

CPU Architecture Nodes Number Cores per Node Available Slurm Memory
sandybridge 133 12 120GB
2 12 246GB
2 12 89GB
36 12 57GB
haswell 43 12 160GB
118 16 160GB
52 16 246GB
skylake 48 24 120GB
48 16 246GB
8 24 120GB
cascadelake 55 32 182GB
zen 56 32 182GB
12 128 750GB
2 32 246GB
1 32 372GB
1 64 498GB
1 64 372GB
3 64 1002GB

For the GPU nodes information please see this link: GPUs at ACCRE

Cluster File System

One of the fundamental challenges of building a Linux cluster is to make sure that data is available when needed to any CPU in any node in the cluster at any time. Having a cluster with the very latest CPUs from AMD, Intel, or IBM means very little if those CPUs spend much of their time idle waiting for data to process. There is even a joke which says, “A supercomputer is a device for turning a compute bound problem into an I/O bound problem!”

In small clusters data is typically made available to the nodes via a Network File System (NFS) server which stores all of the user data and exports it to the entire cluster. Applications can then access data as if it were local. However, there are two significant disadvantages to this approach. First, the NFS server is a single point of failure. If the NFS server is unavailable for any reason then the entire cluster is unavailable for use and jobs which may have been running for weeks may be killed. This single point of failure can be eliminated by clustering two (or more) servers, but this can quickly become very costly and, more importantly, does not solve the second significant disadvantage of NFS: poor performance. NFS typically does not scale to much over 100 MB/second bandwidth. Because of this ceiling, performance may be adequate for small clusters of up to approximately 150 nodes; scaling to a greater number of nodes causes the limitations of NFS to quickly become apparent.

Because the ACCRE cluster began as a small (120 nodes) cluster, ACCRE initially took the NFS server approach. However, by the fall of 2004 both of the limitations of the NFS server approach had become apparent and a search for a more robust, higher performance solution was initiated. After a lengthy evaluation process, ACCRE selected IBM’s GPFS (General Parallel Filesystem) to replace NFS. GPFS was placed into production in August of 2005.

In a GPFS cluster two or more servers are set up as GPFS cluster and filesystem managers. In addition, two or more servers are set up as disk I/O servers called Network Shared Disk (NSD) servers. NSD servers are connected to one or more storage arrays in a redundant configuration. This is typically accomplished via a Storage Area Network (SAN). In a redundant SAN configuration all NSD servers and storage arrays are each connected into two or more SAN switches. This allows all NSD servers to be able to see the storage in all the storage arrays. In the event of a SAN switch failure GPFS will use a feature of the Linux operating system call multipathing to redirect I/O thru a different SAN switch. The disk drives in the storage arrays are configured into RAID devices which protects against the failure of a disk drive. Each RAID devices is then “seen” by the NSD servers as a “disk” which may be used in a filesystem. When a GPFS filesystem is created a primary NSD server and up to 7 backup servers are defined for each disk. In the event of a failure of the primary NSD server for a disk the first available backup server will take over for it transparently. This means that, if configured properly, GPFS filesystems have no single point of failure.

In a typical configuration a GPFS filesystem will contain multiple disks and the data will therefore be striped across those disks. This means that during normal operations I/O requests will involve multiple NSD servers and multiple disks working in parallel, hence the name General Parallel Filesystem. In addition, GPFS offers the ability to separate data from metadata. This prevents someone who is doing an “ls” on a directory containing a very large number of files from impacting the performance of someone else who is attempting to read the data contained in a single large file. As many ACCRE users do have large numbers of small files we have implemented this feature.

In a GPFS cluster the client machines (cluster gateways and compute nodes) all run a GPFS daemon which allows them to access the GPFS filesytems as if they were local … i.e. from a user perspective the fact that the files are not local to the node they are using is irrelevant.

ACCRE is currently on its’ second generation of GPFS hardware (see diagram). Each of the 9 NSD servers has a 10 gigabit ethernet network card. Therefore, the current setup is capable of sustaining 9 gigabytes per second of I/O. For comparison, a standard DVD is 4.7 gigabytes. The capacity of our GPFS filesystems can be increased if necessary by adding additional storage arrays. The performance of our GPFS filesystems can also be increased by adding additional NSD servers. Please note that both of these can be done without a downtime.

ACCRE also has implemented the standard cluster practice of having two filesystems, one for /home and one for /scratch. However, ACCRE also has implemented a /data filesystem. The /data filesystem is stored on the same physical disk hardware as the /home filesystem. However, there are two differences between /home and /data: 1) additional quota can be purchased for /data but not for /home and, 2) tape backups are retained for a minimum of one month on /data and two months on /home. Files stored in /scratch are not backed up to tape.

Before implementing the second generation of GPFS hardware ACCRE staff reviewed the possible alternatives to GPFS (Lustre, Panasas, BlueArc) again. While GPFS is not perfect we concluded that none of the other options offered any benefits over GPFS. In addition, several of them are significantly more expensive than GPFS. GPFS is a mature, stable product that is supported by IBM and used by many of the top supercomputers in the world.


Data Storage and Backup

The True incremental Backup System (TiBS) is used to backup the ACCRE cluster home directories nightly using a Spectra T950 tape library. A big advantage of TiBS is that it minimizes the time (and network resources) required for backups, even full backups. After the initial full backup, TiBS only takes incremental backups from the client. To create full backups, an incremental backup is taken from the client. Then, on the server side, all incremental since the last full backup are merged into the previous full backup to create a new full backup. This takes the load off the client machine and network. The integrity of the previous full backup is also verified. Please see our disk quotas and backups policies for more information. (TiBS is available for all current operating systems and apart from the cluster, ACCRE also offers backup services for data located remotely. This service is through special arrangement. If you are interested, please see our Tape Backup Services and contact us for more details.)

Resource Allocation

A central issue in sharing a resource, such as the cluster, is making sure that each group is able to receive their fairshare if they are regularly submitting jobs to the cluster, that groups do not interfere with the work of other groups, and that research at Vanderbilt University is optimized by not wasting compute cycles. Resource management, scheduling of jobs, and tracking usage is handled by SLURM.

SLURM supplies user functionality to submit jobs and to check system status. It is also responsible for starting and stopping jobs, collecting job output, and returning output to the user. SLURM allows users to specify attributes about the nodes required to run a given job, for example the CPU architecture.

SLURM is a flexible job scheduler designed to guarantee, on average, that each group or user has the use of the particular number of nodes they are entitled to. If there are competing jobs, processing time is allocated by calculating a priority based mainly on the “fairshare” mechanism of SLURM. On the other hand, if no jobs from other groups are in the queue it is possible for an individual user or group to use a significant portion of the cluster. This maximizes cluster usage while maintaining an equitable sharing. You can find more details about submitting jobs through SLURM in our Getting Started section or ACCRE’s own SLURM Documentation page.

For specific details about ACCRE resource allocation, please see more about ACCRE job scheduler parameters.

Installed Applications

The cluster offers GCC and Intel compilers, with support for multithreading and MPI libraries. We also compile a large number of widely used applications for areas of research spanning biology, bioinformatics, materials science, chemistry, astronomy, and so on. A comprehensive list of packages installed at ACCRE for general use is available here. ACCRE staff are also available to assist with the installation of additional tools. Please open a helpdesk ticket for such requests.