Installing and managing software

Installing software

Do I have to install the software?

No, if your software is available in the Ubuntu software repository. Feel free to kindly ask in the Workstations stream in Zulip if it can be added to the default workstation setup.

No, if somebody else has already installed it and made it available as a module. Type module avail on the command line for a list of available packages. If you’d like to maintain one of them or add a new one, don’t hesitate to volunteer in the Cluster stream in Zulip.


Don’t use sudo!

Installation instructions including commands with sudo assume that you install software on one computer on which you have superuser permissions. On our workstations and the cluster, you don’t have these permissions for practical reasons (imagine 50 users installing software in this way on a shared system), and you’d like to install software in our shared filesystem (/nethome) so you can use it on all machines after installing it once. Moreover, your colleagues just need to type module use <path_to_your_module_files> in the terminal, and everybody can use the software packages you installed (given that you don’t change the default of granting read and execution permissions to other group members). If the software package is continuously used by several group members, please consider volunteering in the Cluster stream in Zulip to maintain an official module in /opt/tcbsys.

Here’s how to avoid sudo and install in a custom directory with the most common build systems.


Directory layout

Most software packages have a source code that needs to be compiled and installed. To keep things organized, we recommend creating your own software directory containing the subfolders sources, packages, modules. sources should contain the source code you downloaded (only the archive), whereas it is recommended to extract the archive and build the software in /scratch (where the files can be removed after installing), packages contains only the installed executables and libraries, and modules is a collection of the respective module files needed to load the software package into your environment.

For scientific software, reproducibility is important. Because of that, we recommend a directory layout that allows installing several versions of the same software, so one can verify that a new version does not produce different results. Always make notes and be clear about what version of a software you used to generate and analyze your data. So a good layout could look like this: 

software
├── modules
│   └── <NAME>
│       └── <VERSION>         <-- module file
├── packages
│   └── <NAME>
│       └── <VERSION>         <-- installation directory
└── sources
    └── <NAME>
        └── <VERSION>.tar.gz  <-- source code archive


Binary packages (tar.gz)

If you can download a precompiled binary package, that is usually the easiest solution (you just need to unpack it, no installation required). In that case, you should look for a download that is a good match for our environment (Ubuntu 22.04, x64/amd64 at the time of writing). If it was compiled for a different operating system, it might not run due to incompatibilities regarding C library version or dependencies. A tool to check which libraries a binary requires is ldd. The following example shows a missing library:

$ ldd /opt/tcbsys/awk/gawk-4.1.4/bin/awk
linux-vdso.so.1 (0x00007ffc7e39c000)
libreadline.so.6 => not found
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff99ddb4000)

If the dependencies cannot be resolved in a simple way (see Running software for tips and tricks), compiling from source is the next option.


Binary packages (deb)

This is a special case of a binary package. The .deb package format is the native format for Debian and Ubuntu. These packages usually get installed using apt, dpkg or the software center app, but this requires special permissions and alters the system. However, just like a .tar archive, the files in a .deb package can be unpacked and copied to your home directory. Here is an example:

$ dpkg-deb -x Pencil_3.0.4_amd64.deb /tmp/pencil
$ mkdir -p $HOME/software/packages/pencil
$ mv /tmp/pencil $HOME/software/packages/pencil/3.0.4

You will have to check the directory structure inside the extracted directory to locate the libraries and executable files and possibly fix some links.


Configure script
(./configure && make && sudo make install)

This is one of the simplest cases when building from source. Usually the configure script supports the --prefix flag that sets the path to the installation directory. You can run ./configure --help to check. So, for example, installing GNU Awk version 5.2.0 in your home directory would look something like this:

$ mkdir -p /scratch/$USER/build
$ cd /scratch/$USER/build
$ tar xzf $HOME/software/sources/gawk/5.2.0.tar.gz
$ cd gawk-5.2.0
$ ./configure --prefix=$HOME/software/packages/gawk/5.2.0
$ make -j8
$ make install

See section Running software below for how to run your newly installed program.


Python / Conda / pip / …

Python software packages are prominent in scientific programming. It has become a popular choice in scientific communities for various reasons including rich scientific libraries, e.g. NumPy, SciPy for data analysis, Matplotlib for visualization, and PyTorch, Tensorflow for machine learning. However, Python is known for its “dependency hell” where managing and resolving dependencies can be complex. The introduction of virtual environments can mitigate such challenges and is recommended to be utilized in day-to-day work rather than working with a global Python environment. In this section, we will illustrate:
1) How to create a virtual environment,
2) How to install Python packages from the official Python Package Index (PyPI, also known as the Cheese Shop) or Conda,
3) How to compile and install a Python package from source.


Virtual environment with venv

The easiest way to install a virtual environment in our cluster is using the default Python binary installed in our cluster:

# create a new virtual environment
$ python3 -m venv $HOME/software/<env_name>
# activate the virtual environment
$ source software/<env_name>/bin/activate

After this, you should notice (env_name) appears in front of your command prompt. To verify you are indeed using the right Python binary, you can type in which python3 and nethome/username/software/<env_name>/bin/python3 should show up. Deactivate the virtual environment with deactivate. For more information, visit the venv webpage.


Virtual environment with a package manager

If you are dealing with more complex dependencies, e.g. with non-Python packages, or want to use a different version of Python, Conda (or Anaconda) is an alternative way to create virtual environments. However, you need to first download Anaconda from its website. You can then run the downloaded Anaconda3-xxx-linux-x86_64.sh with bash and follow the prompts to complete the installation. Make sure to select yes when asked if you want to initialize Anaconda and add it to your shell’s PATH.

After that, you can

# create a virtual environment
$ conda create -n <env_name> python=<desired_python_version>
# activate the virtual environment
$ conda activate <env_name>
# and deactivate it
$ conda deactivate

Similarly, you should see a (env_name) in front of your prompt. For more information, visit the Conda webpage.


Package installation

Installing (simple) packages is quite straightforward in Python, you just need to first activate a virtual environment and then either install it with pip or conda.

$ pip install <package_name>

or 

$ conda install <package_name>

If you prefer not to work with virtual environments you can install programs (executable applications and scripts) by using pipx. Behind the scenes this installs the program in an isolated virtual environment, meaning that no dependencies will clash with your other tools:

$ pipx install <program_name>

It is also possible to install packages into a specific directory, making them available for scripts placed along them (type pip3 to be sure you use the pip installation belonging to our python3):

$ pip3 install --target=<path_to_installation_directory> <package_name>


Compile packages from source

Sometimes, you might need to compile a Python package either because it is not indexed on PyPI and is only available as a git repository or you want to install a developer version of the package (or even patch yourself 🙂 ). Most installable Python packages come with either setup.py and/or pyproject.toml files to guide pip to compile, configure, and install the package and its dependencies. It is recommended to follow the detailed package documentation for each specific package (e.g. for MDAnalysis). Generally, it’s not that different from installing with pip from a remote source but you rather download the package, activate your virtual environment and then

$ pip install .

If you want to install in developer mode (so you can modify the source code actively), use

$ pip install -e .

You can verify it is installed correctly by

$ python3
import <package>
print(<package>.__file__)

As above, pipx can be similarly used to install downloaded programs and scripts without using a virtual environment by doing

$ pipx install . 

For packages that are not installable, i.e. basically a random folder with a lot of messy scripts, you might need to either copy the folder to the same directory as your scripts or import it at runtime:

$ python3
import sys
sys.path.insert(1, '/path/to/software’)

Alternatively, you can just add the respective folder to your PATH variable.


PyTorch and Tensorflow

As AI/ML become more and more important, we present an example of how to install two of the most popular Python packages of the field: PyTorch and Tensorflow.

# PyTorch
$ conda create -n pt-gpu pytorch torchvision torchaudio python=x.x \
  pytorch-cuda=x.x -c pytorch -c nvidia
# Tensorflow
$ conda create -n tf-gpu tensorflow-gpu python=x.x cudatoolkit=x.x

There might be conflicts with specific versions of Python and CUDA. However, python=3.11 & pytorch-cuda=11.8 could be successfully combined for PyTorch, and python=3.7 & cudatoolkit=11.8 were successfully used for Tensorflow. Make sure that the chosen version of CUDA is compatible with the Nvidia drivers installed on the cluster . To check, run nvidia-smi on a GPU node since the setup is consistent between all GPU nodes (to login to a node see Partitions->Slurm workload manager->Interactive node allocation).

Here is an example script how to run your PyTorch/Tensorflow code with CUDA on our cluster:

#!/bin/bash
#SBATCH -p lindahl1,lindahl2,lindahl3,lindahl4
#SBATCH -G 1 -N 1 --mem=4G
#SBATCH -t 12:00:00
#SBATCH -e job-%J.err
#SBATCH -o job-%J.out
#SBATCH --mail-type=ALL --mail-user=yourname@example.com

# Source (mini)conda/(micro)mamba wherever it is installed
# For example:
source "${HOME}/conda/etc/profile.d/conda.sh"

# If you want to run a PyTorch script:
conda activate pt-gpu
srun python3 <my-pytorch-code>.py 
conda deactivate

# If you want to run a Tensorflow script:
conda activate tf-gpu
srun python3 <my-tensorflow-code>.py
conda deactivate


CMake

CMake is software managing the installation of other complex software packages. Similarly to configure scripts, you run in your build directory:

$ cmake <path_to_source_code> -DCMAKE_INSTALL_PREFIX=<install_dir> \
                              <other_CMake_options>
$ make -j 8 
# if your software package provides a test suite, run it
$ make check
$ make install

In the above example, -DCMAKE_INSTALL_PREFIX is used to specify the directory to which you’d like to install your software package. As it starts with -DCMAKE, it’s a standard CMake flag that is most likely shared among all software packages you wish to install.

However, software developers usually only make the effort to offer installation via CMake if there are many software-specific flags to customize your installation. For that reason, we don’t provide (more) typical CMake flags here but rather recommend you have a look at the documentation of the respective software package you’d like to install.

Hint: CMake options always start with -D. If the documentation page of your software lists CMake flags that don’t start with -D, prepend it!

CMake is very verbose: be prepared to see a lot of output in your terminal while configuring, building and installing! As a nice side effect, it usually issues very precise and clear error messages if something goes wrong, but you might need to scroll up a bit.


GROMACS

The following instructions may be helpful to whoever would like to compile an unofficial/modified version of GROMACS currently under development, which would be impractical to build as a module. If you just want to run simulations using one of the official GROMACS releases, just run module load gromacs/<release>.

GROMACS is highly optimized in terms of performance. To make most efficient use of CPUs, GROMACS can adapt itself to the instruction set the CPU uses for Single Instruction, Multiple Data (SIMD) parallelism (see more on https://manual.gromacs.org/current/install-guide/index.html#simd-support and below in this section). If you’re really into it, you can find the SIMD instruction sets supported by your CPU in /proc/cpuinfo in the category flags. In general, it’s sufficient, however, to let GROMACS detect the optimal instruction set itself (as it’s done in the script below). Even if the choice made by GROMACS isn’t ideal, most of your simulations on our cluster run mainly on the GPU such that the CPUs don’t determine the overall performance.

The script below (build_gmx.sh) can be used to build GROMACS from the source code contained in $HOME/<my_gromacs>.

#!/bin/bash
#SBATCH --time 2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=10G
#SBATCH -G 1
# Run one compilation job at a time to use as little bandwidth as possible
#SBATCH -J build-gmx
#SBATCH -d singleton

PARTITION=$SLURM_JOB_PARTITION

# In case you are building for the first time
mkdir $HOME/gmx-builds

# Target directory where to put Gromacs binaries for the selected partition
bindir="$HOME/gmx-builds/$PARTITION"
mkdir $bindir

module load cmake/3.24.2
module load gcc/11.2 # or use default gcc of our cluster (11.3.0)
module load cuda/11.8.0
module load openmpi/4.1.5
module load hwloc/2.7.1
module load fftw/3.3.10-sse2-avx-avx2-avx128fma-avx512

# Put the build directory on /scratch to speed-up I/O
builddir="/scratch/$USER/$SLURM_JOBID/build"
mkdir -p "$builddir"
cd "$builddir"

export CC=gcc-11
export CXX=g++-11

# Building with MPI
cmake $HOME/<my_gromacs> -DCMAKE_INSTALL_PREFIX=$bindir \
                         -DGMX_MPI=ON \
                         -DGMX_BUILD_OWN_FFTW=OFF \
                         -DGMX_GPU=CUDA \
                         -DGMX_HWLOC=ON
make -j 8
make install

cd ..
rm -r "$builddir"
mkdir -p "$builddir"
cd "$builddir"

# Building without MPI
cmake $HOME/<my_gromacs> -DCMAKE_INSTALL_PREFIX=$bindir \
                         -DGMX_MPI=OFF \
                         -DGMX_BUILD_OWN_FFTW=OFF \
                         -DGMX_GPU=CUDA \
                         -DGMX_HWLOC=ON
make -j 8
make install

cd ..
rm -r "$builddir"

The script is called via:

$ sbatch -p <partition_name> build_gmx.sh

After the job is completed, the directory $HOME/gmx-builds/<partition_name> will contain a subdirectory bin containing the binaries for GROMACS in single precision with MPI (gmx_mpi) and without MPI (gmx). The script is agnostic of the type of hardware (CPU and GPU) on the partition’s nodes: the optimal SIMD will be detected automatically (we trust in GROMACS’s automatic hardware detection). If the node does not have any GPU, just remove #SBATCH -G 1, module load cuda/11.8.0 and -DGMX_GPU=CUDA from the script above. Note that it is important to clean the build directory after each compilation in order to not incur in compilation conflicts (MPI ON/OFF) because CMake reuses/only partially overrides the files from the previous configuration/compilation. When running a job, the desired GROMACS build can be sourced as:

[...]
#SBATCH --partition=<partition_name>
[...]
PARTITION=$SLURM_JOB_PARTITION
source $HOME/gmx-builds/$PARTITION/bin/GMXRC
[...]

Note that it is not possible to compile on more than one partition with a single slurm batch script.

Alternatively, one could use a similar script to build for a few different SIMD versions. Currently, the optimal SIMD for the CPUs on lindahl[1,2,4] is AVX2_256, while for the ones on lindahl3, it is AVX2_128. The desired SIMD can be specified as CMake option -DGMX_SIMD=<simd>. Say that each binary is a folder called $HOME/gmx-builds/<SIMD>/bin; then the correct binaries can be sourced when running via:

gmx_dir="$HOME/gmx-builds"
SIMD=$(/opt/tcbsys/modulefiles/get_simd.tcl "$gmx_dir")
GMXRC="$gmx_dir/$SIMD/bin/GMXRC"
source "$GMXRC" || { echo "Failed to source GMXRC file $GMXRC"; exit 1; }

A similar setup is used by the GROMACS modules you can already load on the cluster. This way, the modules can detect the best SIMD instruction set and choose the corresponding GROMACS installation behind the scenes without you noticing, while you have to take care of selecting the most suitable SIMD instruction set for the partition when installing yourself.


Running software

So now that the software is (compiled and) installed, how can you run it without having to type the full path of the executable every time, and how does it locate all the necessary libraries?

We first guide you through the relevant environment variables PATH and LD_LIBRARY_PATH. These variables can be modified directly in the terminal (as explained in the next two paragraphs), but in your day-to-day work, you’d rather encode this info in a module file and have PATH and LD_LIBRARY_PATH updated when loading the module than doing it on the command line or in your .bashrc.


PATH variable

The PATH variable contains a list of directories that are searched for executable files whenever you enter a command without a path. In one of the examples above, you run ./configure. In this case, configure is the name of the command, and ./ is the path that points to the current directory. Without that path, the shell would search all the directories in the PATH variable for a command named configure and would show you a command not found error. To check your current PATH variable, you can simply print it with echo

$ echo "$PATH"
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

What you see above is just the default on the system; each of the paths is separated by “:”. Whenever you modify the PATH variable in your terminal session, you need to make sure to not remove these parts. Otherwise, simple commands like ls will not be found anymore. So usually, you do something like this to expand the variable:

$ export PATH="$HOME/software/gawk/5.2.0/bin:$PATH"

 or to have your own directory searched last:

$ export PATH="$PATH:$HOME/software/gawk/5.2.0/bin"

If you run this command, the variable will only be altered in your current terminal session. If you want this change to persist, you have to add it to your bash shell initialization file $HOME/.bashrc. But be careful, if you introduce errors in that file, you might not be able to log in anymore. So it is a good idea to edit in one terminal session and test in a second one by running bash -l or source ~/.bashrc.


LD_LIBRARY_PATH variable

Just like the PATH variable is used to locate executable files, the LD_LIBRARY_PATH variable is used to locate library files. The main difference is that LD_LIBRARY_PATH is not set by default. Instead, each system has a cache that is searched (see “man ld.so” if you’re interested in all the gory details).

One pitfall is that once you set the variable, the directories in it will always be searched before any system location (for the PATH variable, you can decide the order as shown above). That means if you add a directory of, for example, an old GCC installation, you might all of a sudden get errors when running programs that rely on the newer GCC library installed on the system.

However, there is a way around this problem. Executable files can have an embedded search path (again, see “man ld.so” for details if interested). Here is an example for a GROMACS executable:

$ patchelf --print-rpath /opt/tcbsys/gromacs/2023/gmx/AVX2_256/bin/gmx
$ORIGIN/../lib

The ORIGIN variable is replaced at runtime with the location of the executable file. So this instructs the dynamic loader to look in the directory /opt/tcbsys/gromacs/2023/gmx/AVX2_256/lib for the necessary libraries to run the program.


Modules

Modules are a convenient way to automatically update your environment variables with the paths required to execute an installed software package. Usually, a module file is stored in a folder named as the software package (e.g. gromacs), is named as the version of the software package (e.g. 2023.2) and looks roughly like this:

#%Module

proc ModulesHelp { } {
     puts stderr "<short_description_of_the_software>"
}

set version <the_name_of_this_module_file>
conflict <the_name_of_the_software/folder>

set          root            <path_to_your_installation_directory>
prepend-path PATH            $root/bin
prepend-path LD_LIBRARY_PATH $root/lib

If your respective software is a Python package, it should be added to the environment variable PYTHONPATH, too, such that your Python interpreter can find it.

To use your own modules, you type on the command line or add to your bash scripts or your .bashrc:

$ module use <path_to_your_module_files>

If you have, for example, a module file 2023.3 in /nethome/<your_user_name>/software/modules/gromacs, you should use

$ module use /nethome/<your_user_name>/software/modules/

To check which modules are available, you type

$ module avail

on the command line. For our above example, the command will return a list containing gromacs/2023.3. If you want to search only for all GROMACS modules available, you can narrow down by typing

$ module avail gromacs

If you want to know more about a particular module, you can get this info from module help. In our example, you type

$ module help gromacs/2023.3

Once you’re confident about the module, you can load it with module load, e.g.

$ module load gromacs/2023.3

This command updates your environment variables as specified in the module file, i.e., in most cases, it prepends the correct paths to your PATH and LD_LIBRARY_PATH variables.

When you’re done using the respective software package, you can remove it from your environment with module unload, e.g.

$ module unload gromacs/2023.3

This command resets your environment back to the state it had before the module was loaded.

Note that modules are “automatically unloaded” when you close a terminal or a bash script finishes because environment variables are only propagated to child processes.