Partitions

Introduction

Here we collect some information about our own cluster that is maintained by volunteers within the group. Help is very appreciated, so feel free to ask how you can help out. This could be anything, e.g. writing documentation, keeping certain software packages up-to-date or maintaining a backup server.

Since the cluster’s architecture is heterogeneous and may undergo frequent upgrades, it is important that users remain up to date. New changes are usually communicated on the #Cluster stream on Zulip, so stay tuned. Given that the maintenance and the documentation of these resources is on a volunteer basis, do not expect the same readiness in support as you would for a large scale HPC resource.

For large compute jobs there are several external resources available, see Common External HPC, where you or your PI may already have or can get an allocation.

Hardware

Login nodes

The primary login node to access the group cluster is login.tcblab.org, and there is a backup node login2.tcblab.org in case the first one is down. As the login nodes are shared by many people and are the main access point to the cluster, do not run compute-intensive tasks here, i.e. avoid occupying a large number of threads, running memory-intensive tasks, and limit I/O. The login nodes can run 32 threads in parallel, so if you compile code run for example make -j 8 (just running make -j does not limit the number of processes and can cause very heavy load!). Also note that for such kind of work it is much better to use the local /scratch directory instead of nethome or CephFS. Just create a directory and work there.

$ mkdir /scratch/$USER
$ cd /scratch/$USER

NB: the /scratch directory is local to each machine, so if you need files from there available on compute nodes, you will have to copy them to nethome or CephFS.

Compute nodes

The compute nodes are where the workload manager runs your jobs. Depending on the requested resources they can be shared by a number of jobs, but each job is guaranteed (and confined to) the requested resources.

In contrast to the login nodes the compute nodes are only available on the internal network, but they can reach external resources through a NAT gateway. You can log in to a compute node using ssh from one of the login nodes if you have a job running there (for example to check resource usage). To get a detailed list of all nodes where your jobs can run use the command:

$ sinfo -N --long

Or to see an even more detailed list you can use the alias:

$ snodeinfo
NODELIST   PARTITIO       STATE CPU S:C:T    MEMORY GRES                 ACTIVE_FEATURES
[...]
gpu12      lindahl1    drained*  32 2:8:2    126976 gpu:2080:4           gpu,avx2,broadwell,turing
gpu13      lindahl1   allocated  32 2:8:2    126976 gpu:2080:4(S:0-1)    gpu,avx2,broadwell,turing
[...]

The different fields will be explained later on, for now let’s focus on CPU and S:C:T. The latter stands for “sockets”, “cores” and “threads”. The nodes shown above have two sockets, that means two physical CPUs on the same mainboard, each of these CPUs has 8 cores, and each of these cores can run two threads. So in total this node can run 2*8*2=32 threads in parallel. So that is the number of logical CPUs (the CPU field) for the node.

Network

The network connecting all these nodes and the storage servers is an Ethernet network and most of the compute and storage servers have a 50 Gbit/s network card. While this provides a good bandwidth, it has a much higher latency than an Infiniband network for example. So this cluster is not suitable for running jobs that require a lot of low latency communication between nodes. For such jobs a cluster with InfiniBand network is the preferred choice (see external HPC resources).

Accessing software via modules

Most compute clusters (including ours) offer the possibility of loading software into your environment via modules that are accessible by all users. Modules are particularly convenient when dealing with multiple versions of software that is utilized by a vast portion of the cluster users (e.g. C/C++ compilers, Python, OpenMPI, …). 

The command:

$ module list

provides the list of modules that are currently loaded in the environment. To list all the modules available for loading run:

$ module avail

To only list the available versions of the “cuda” module for example run:

$ module avail cuda

To load a module run:

$ module load <module-name>

which can be shortened to ml <module-name>. You can either load a specific version, e.g:

$ ml gcc/10.2
Loading gcc/10.2 with path /opt/tcbsys/gcc/10.2

or the default one:

$ ml gcc
Loading gcc/11.2 with path /opt/tcbsys/gcc/11.2

To unload just run:

$ module unload <module-name>

NB: module will prevent you from loading a different version of a module that is already loaded (and will throw an error message). You have to unload the module on your environment and then load the other version (or use the module switch command). This is particularly important to keep in mind when loading modules inside slurm job scripts, as it is shown the next session.

Slurm workload manager

Introduction

We use the workload manager Slurm. If you’re familiar with other workload managers like SGE or LSF, you may find this cheat sheet useful: https://slurm.schedmd.com/rosetta.pdf.

If you’re not familiar with the concept, a workload manager takes care of scheduling the workloads on the available compute resources. So instead of having to look for an available compute node, and logging in there to start your job, you just submit your job to a queue, and the workload manager takes care of running it on a compute node that matches the requirements for the job.

Slurm has many default options which makes it rather user-friendly. Thus said, specifying the amount and type of resources in detail will result in jobs that are more efficient and less disruptive for other users.

Submitting jobs

Jobs are submitted to the queue by using the sbatch command:

$ sbatch SBATCH-SCRIPT-NAME.sh

If you’re not familiar with the manual pages in general, they are usually a good source of information. Just execute man sbatch and “RTFM” as someone in the group is sometimes quoted ;-).

Slurm is configured so that several jobs can run in parallel on a node. Each job is confined to the CPUs, memory and GPUs that are assigned to it. There are several partitions available, currently delemotte[1]lindahl[1-5], largemem, cryoem, cs-cpu and cs-gpu. To which ones you can submit jobs depends on your group membership. You specify which one to use by using the flag -p either on the command line or in your job script (see example below).

Be aware that the default partition is cryoem, i.e. if you don’t specify any partition Slurm will try to start your job there and automatically fail if you are not allowed to run there. So always specify the partition(s) you intend to use!

You can also specify several partition names, separated by comma, and the job will run wherever resources become available first. You’ll have to make sure that the time you request is reasonable for your job and not longer than the maximum execution time for the partition. To check the different parameters for each partition you can run:

$ scontrol show partitions
[...]
PartitionName=lindahl1
   AllowGroups=tcblab,tcbguests AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=08:00:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
[...]

The main reason for the five lindahl partitions is the difference in CPU architecture and cores between the nodes (all support two threads per core): 

lindahl1: 2×8 cores (Intel Broadwell), 4× GTX 2080 GPUs (NVIDIA Turing)
lindahl2: 2×12 cores (Intel Skylake), 4× GTX 2080 GPUs (NVIDIA Turing)
lindahl3: 32 cores (AMD Naples), 4× A5000 GPUs (NVIDIA Ampere)
lindahl4: 2×24 cores (AMD Milan), 8× A5000 GPUs (NVIDIA Ampere) or 8× Geforce RTX 4070 Ti GPUs (NVIDIA Lovelace)
lindahl5: 2×12 cores (Intel Broadwell), 4× NVIDIA GPUs, MAX_TIME 24h, 64 GB RAM

If you don’t care on which one your job runs, you may put all of them in your job script, like this:

#SBATCH -p lindahl1,lindahl2,lindahl3,lindahl4

The GPUs you select using the flag (short or long version):

-G, --gpus=[<type>:]<number>

which specifies the total number of GPUs required for the job.

The most simple form looks like this:

#SBATCH -G 1

An optional GPU type specification can be supplied, for example:

#SBATCH -G a5000:2

Which requests two A5000 GPUs (currently on lindahl3 and lindahl4).

If you need even more fine-grained control on which nodes your job should run, there are also features defined on each node. To specify which features your job requires, you can use the following flags (again short or long form doesn’t matter):

-C, --constraint=<list>

So if you for some reason want to run your job in the one node that has the features *skylake* and *volta* you would use this constraint:

#SBATCH -C skylake,volta

As mentioned in the Compute nodes section, the snodeinfo command provides information about hardware and features for each node. If you use constraints make sure to include the correct partition, otherwise the job will not start due to the error “Requested node configuration not available”.

Unless otherwise specified, for each GPU requested you will get a quarter (an eighth in the lindahl4 partition) of the CPU cores and 512 MB of memory for each CPU core.

You should use the Slurm environment variable SLURM_CPUS_ON_NODE in your script to tell your program how many threads to use (for GROMACS that is done using the -nt or -ntmpi and -ntomp flags).

Since the nodes can be shared between jobs, you may want to be specific about what you request. For example if all you request is two GPUs and there happens to be one GPU available each on two different nodes then that will satisfy your request. But that means that you will need a program that supports MPI to communicate between these two nodes. Most likely though that is not what you want, so make sure to explicitly specify the number of nodes, e.g:

#SBATCH -G 2 -N 1

Also, because of the node sharing, Slurm will handle system memory (RAM). So if you know that your job needs a lot of RAM make sure to request sufficient resources using the --mem (i.e. memory-per-node) or the --mem-per-gpu flags. On the other hand you shouldn’t request too much either, so it’s fine to start with the default and increase in case your job terminates with a memory error.

Here is a working example job script:

#!/bin/bash

#SBATCH -p lindahl1
# name of job in queue
#SBATCH -J job-name
# request 2 GPUs on one node and 4 GB RAM
#SBATCH -G 2 -N 1 --mem=4G
# time limit for job
#SBATCH -t 12:00:00
# output file names for stdout and stderr
#SBATCH -e job-%J.err
#SBATCH -o job-%J.out
# Send an email in case the simulation fails
#SBATCH --mail-type=FAIL --mail-user=yourname@example.com
# the following line is commented out and will be ignored by Slurm
##SBATCH -C skylake

# Load modules
module unload gromacs
module load gromacs/2023.2

srun gmx mdrun -nt $SLURM_CPUS_ON_NODE

As you may have already figured out, Slurm looks for lines starting with #SBATCH to pick up the options you set. The shell (bash) ignores lines starting with #, they are just comments that are ignored. Hence it will run as a regular script. Any of these flags can be specified directly on the command line, or in the script in a #SBATCH line.

Note that in order to run on new GPUs you may need to load a sufficiently recent version of CUDA. In the job script, just add for example:

module load cuda/11.8

before loading/sourcing Gromacs.

One rather important sbatch flag is -d (or --dependency=<dependency_list>) which defers the start of this job until the specified dependencies have been satisfied. The most common usage is:

#SBATCH -d singleton

which begins execution after any previously launched jobs sharing the same job name and user have terminated. This can be very useful to run a job over several days. By default Slurm limits the time for a job to 24h; to circumvent this limitation the user can devise a script containing -d singleton to be launched multiple times to effectively extend/restarts the computation of the previously launched job.

NB: sbatch exports your current environment (e.g. loaded modules and environment variables) to the node where your job is executed. For the GROMACS module, it is important that it is loaded in the sbatch script, i.e. on the node where your job will run, because it will pick the correct binary based on the SIMD capability of the CPUs in the node. So if you loaded it already on the login node to prepare your job files, either unload it before you submit your job with sbatch, or put module unload gromacs first in your sbatch script, followed by module load gromacs (as above).

Other useful Slurm commands to read about are: squeue, scancel, salloc, srun, …

Interactive node allocation

The Slurm command salloc takes the same arguments as sbatch, e.g:

$ salloc -p lindahl1 -N 1 -G 4 --mem=16G

allocates a node for interactive usage. When that command returns, it has spawned a new shell, and tells you the numerical ID of the allocation that will show up in squeue. Note that a node has just been allocated for you, but you are still on the login node! You could view the allocated IDs with squeue to find out which node you have, or just execute srun hostname. Any command you execute with srun in this terminal session will be executed on the allocated node.

If necessary you can log in to your allocated node with ssh $SLURM_JOB_NODELIST. Note that this login will exit if your terminal closes, so only use it for genuinely interactive work that you’re prepared to lose when you close your laptop, lose network, etc. To get an interactive shell on a node right away (no need to salloc first) you could also run:

$ srun --pty -p lindahl1 -N 1 -G 2 --mem=4G -- bash -l

Using local scratch disks

Accessing files in your /nethome directory tends to be slow, and you can easily overload the system with compute jobs causing problems for other users. If your analysis or simulation causes a lot of I/O, e.g. because of frequently accessing files, you should probably run your job in a temporary, local directory. To do this, in your job script create a directory in /scratch and simply make sure all IO takes place there (e.g. by cd there, or by using the --chdir flag in srun). Below is an example of how to move data onto the local /scratch disk of a compute node, run a GROMACS simulation there, and then copy the data back:

# Directory from which this script was submitted to slurm
SUBMITDIR=${SLURM_SUBMIT_DIR} 

# Working directory on /scratch on compute node
WORKDIR=/scratch/$SLURM_JOB_NAME
echo "Making dirs in $WORKDIR"
srun -n $SLURM_JOB_NUM_NODES mkdir -p $WORKDIR

#Copy data to /scratch on compute node
echo "RSyncing data from $SUBMITDIR to $WORKDIR"
srun -n $SLURM_JOB_NUM_NODES rsync -au ${SUBMITDIR}/* $WORKDIR

# Load modules
module unload gromacs
module load gromacs/2023.2

srun --chdir $WORKDIR gmx mdrun -nt $SLURM_CPUS_ON_NODE

#Copy data from /scratch on compute node
echo "Rsyncing data from $WORKDIR to $SUBMITDIR"
srun -n $SLURM_JOB_NUM_NODES rsync -au ${WORKDIR}/* ${SUBMITDIR}/

It is important to remember that the /scratch directory is local to each compute node and your data stored there is deleted automatically when your job terminates. This is different compared to for example PDC compute resources, where the /scratch directory is a fast network file system that is shared among all nodes and accessible from the login node and may keep data stored for a limited amount of time after the job execution. This means that if you use the Slurm command scancel <jobid> on the login-node, all data generated using the script above will be lost. To avoid this, you should first ssh to the compute node and either kill the process running GROMACS (kill -2 <pid>) and the batch-script will do the transfer before exiting or manually transfer the files back.

It may be useful to allocate a /scratch directory also to build software on a compute node. For example, assume you want to run a simulation with your custom version of GROMACS; then, add the following in your Slurm script before building:

builddir="/scratch/$USER/$SLURM_JOBID/build"
mkdir -p "$builddir"
cd "$builddir"
cmake /PATH_TO_SOURCE_CODE [BUILD_OPTIONS]

If you’re generating data on /scratch, you may have to copy input there and copy output back at the end of your job. If you want to exclude certain files, rsync --exclude may be more useful than simply using cp. Be careful to not overwrite your data (maybe keep a backup somewhere else..) and remember that the scratch directory is lost once the job is terminated, either because it has completed or because it has failed!

Running on multiple nodes with multidir

You may find yourself having to run many parallel simulations that are either independent or require very limited communication. For molecular simulations, that may be the case of independent replicas, or replica exchange algorithms. In GROMACS, one can run a list of such simulations in one go by adding the -multidir flag to mdrun. However, it may happen that all the simulations do not fit on a single node. Here are a few tips and tricks to run with -multidir on multiple nodes:

  • When requiring CPUs/GPUs, remind yourself that Slurm flags can refer both to global quantities (e.g. --ntasks) or local quantities (e.g. --ntasks-per-node). It’s easy to get confused, so be careful or you may end up over/under-using resources!
  • Remind yourself that, on this cluster, scratch directories are local. Hence, it’s not a good idea to store files there that all processes on all nodes need to access. Same goes for building: make sure the binaries you are sourcing are in a global directory.
  • When using the -ntomp flag of mdrun, use SLURM_CPUS_ON_NODE instead of SLURM_JOB_CPUS_PER_NODE; the analogous variable for GPUs is SLURM_GPUS_ON_NODE. So, for example, you have two nodes and eight GPUs allocated for your job, and want to run with one MPI rank per GPU. Then you can let bash do the math regarding number of threads and write srun -n 8 gmx mdrun [...] -ntomp $((SLURM_CPUS_ON_NODE/SLURM_GPUS_ON_NODE)).

In any case, remind yourself that this cluster does not have a fast interconnect and thus running simulations that require a lot of communication is highly discouraged!