Building software on the ACCRE cluster

From ACCRE Wiki

Compute nodes on the ACCRE cluster are heterogeneous in terms of CPU architecture (and also RAM and local disk space). Some compute nodes contain processors that are 4-5 years old, while others use processors that are less than a year old. The same goes for cluster gateways.

This presents some challenges when it comes to building software, as newer processors can use instructions that are unsupported by older processors. As a result, programs that are built from source code on a newer gateway may not run successfully on a compute node or gateway with an older processor. “Illegal Instruction” error messages are likely to occur in this scenario. This is a typical error when running the local built R libraries.

For solving this issue the users can build the software on the public login gateways, these gateways typically ranging from gw341 to gw346. These gateways are all sandy bridge nodes so they are nearly the oldest machine in ACCRE (except westmere), hence the binary compiled from login gateways usually able to run on the other compute nodes like haswell and skylake etc. Additionally, to submit your jobs from the customized compilation you also need to exclude the AMD cpus, hence we suggest you add in the following line to the slurm script:

#SBATCH --constraint=sandybridge|haswell|skylake

This line means only nodes with at least one of specified features will be used. This will make sure your slurm jobs will run without error.

The above constraint condition will allow different jobs able to pick up different type of nodes in the given range, hence for an array job you may find different subjob may land on nodes with different CPU architecture. If you want to enforce all of subjob in the array job uses the same CPU architecture, you can add this line in the slurm script:

#SBATCH --constraint=[sandybridge|haswell|skylake]

This line means only one of the options should be used for all allocated nodes.

If performance is very important to you (especially through vectorization instructions like AVX2), you can try building on a more recent architecture (e.g. Haswell supports AVX2). Just be sure to then only request nodes with Haswell or Skylake processors for jobs making use of this program. Note that this may lead to longer queue times as you are effectively shrinking the pool of resources eligible to run your job.