File systems and networks

Introduction

There are two main file systems available for users in the Molecular Biophysics Stockholm (MBS) group. They are the “nethome” NFS server (fs.biophysics.kth.se) and the CephFS distributed file storage. The first one stores all the users’ home directories and common software installations (modules), the second one data belonging to individual or collaborative projects. In addition to that each machine has a local scratch space that is meant for temporary files to lighten the load on the file servers and optimize performance of I/O intensive compute jobs.

We are careful when handling these storage systems, but accidents happen. So make sure that your important data is backed up somewhere. As a mental exercise imagine your primary storage fails a month before you have to defend your thesis and your laptop breaks/gets stolen/lost, do you have a plan for that?

The home directories are used both at the workstations and at the cluster, CephFS is only available at the cluster, and the scratch space is only locally available on each individual machine. It is important to understand the underlying principles of each filesystem before you start organizing your data or running jobs.

You should always keep in mind that reducing the number of files, for example by creating archives using tar, helps to use storage more efficiently. One important number in this context is IOPS (input/output operations per second) that the storage system can provide. Let’s say the storage system can provide 1000 IOPS and you want to process 1 million small files, then this would take at least 16 minutes (assuming one I/O operation per file for simplicity). But since you’re not the only user accessing the storage, effectively you may only get 100 IOPS and your job now takes close to 3 hours instead. That is the reason why we have limits set not only for the amount of data but also for the number of files.

Whether you do or do not need to write a data management plan for your project, it is a good idea to estimate how much data you will generate, where you will store it, and how you are planning to process it. How much data you generate can often depend on the settings you use. For example make sure that your GROMACS jobs write data to disk at a reasonable rate (i.e. don’t write trajectory output every 100 steps unless you know that you absolutely need it).

Further below you will find some examples for how to copy data efficiently to our storage resources. The following table shows a short summary about these filesystems. The subsections below describe each file system in detail.

storagepathpurposeavailabilitylimitsbackups
Nethome NFS server/nethome/<USERNAME>/Linux home directories, shared software installationscluster and workstations250 GB, 2 million filesnightly with history
CephFS storage/mnt/cephfs/Project volumes, databases, other resourcescluster12 TB, 1 million files (per project volume)no (not yet)
local scratch/scratch/I/O intensive jobscluster0,5 to 2 TBno; cleaned after each job
local scratch/scratch/I/O intensive jobsworkstations0,5 to 1 TBno; cleaned manually
overview of available storage options

Networks

Since we are describing network file systems here it can be beneficial to know how the network is set up at SciLifeLab and how our cluster and the NFS server are connected. At the end of this section is a graph illustrating the setup that we describe here. In general there are two types of networks, public and private. Public networks are accessible from anywhere (but firewalls may restrict access) while private networks are only accessible from within the particular network.

SciLifeLab domain

The general network at SciLifeLab that the workstations are connected to is a public network, and it is managed by SciLifeLab/KTH IT. If there are any issues with this network we rely on them to solve it. You can find information on the SciLifeLab intranet (VPN needed for access) regarding how to report errors. But coordinate with others in the group to try and pin down the issue before filing a ticket to make sure that it is not a problem on our end.

The domains used on this network are scilifelab.se and dyn.scilifelab.se. If the workstation is in the dyn subdomain it has a dynamic IP address (i.e. the address can change over time), otherwise it has a static IP address. If your workstation has a dynamic IP address and has been switched off, wait five or ten minutes after switching it on before trying to log in. The file server needs that delay to detect the new IP address of the machine and allow access.

To log in to your workstation remotely install a VPN client and set it up according to the instructions found on the intranet. Because these pages can only be accessed from SciLifeLab or using the VPN, we put a copy in this topic on Zulip for convenience.

Biophysics domain

KTH IT allocated a public network range for us with the domain name biophysics.kth.se. They trust us to handle this, on the condition that we only have well maintained servers there (i.e. no desktops or workstations allowed!). Here we have for example the cluster login nodes, the Zulip server and our LDAP servers.

To keep hostnames short we use a couple of other domains (biophysics.se, cryoem.se, tcblab.org). Hostnames there usually point to a host in the domain biophysics.kth.se.

If there are any issues with services on this network we have to fix them ourselves. While KTH IT has physical access to our servers they do not have accounts on them. So they can only help if the issue is with the underlying network.

Cluster domain

In addition to the public biophysics network we have a private network for our cluster. Here all the compute nodes, the storage servers and the login nodes are connected with a 100GbE network. Note that Ethernet networks have a higher latency than for example Infiniband networks (common in HPC clusters), more on that on the page Partitions.

Some nodes, like the login nodes, have connections to both the public and the private network in order to allow access to the cluster. This is a quite common setup in order to limit the number of nodes that are exposed to the Internet (and therefore a large number of potential attackers). Nodes that are not on the public network (like compute nodes) can still access the Internet through our NAT gateway, but they cannot be reached directly from the Internet.

The bigger picture

As you can see in the graph below our own network is available in two different server rooms (at AlbaNova and SciLifeLab) that have a fast connection between them. If possible avoid creating traffic through the parts marked with www, since that involves several routers and in general the bandwidth is quite limited and shared by many. For example if you want to download a larger dataset (say 50 GB) to your home directory (stored on the nethome NFS server), it might be intuitive to just click a download button in the web browser at your workstation, or open a terminal there and fire off wget. But since every single file in your home directory is stored on the NFS server, what happens here is that you download the data over the www link to the workstation only for the workstation to send it back over the same link through several routers to the NFS server. If you instead start the download on the login node you can benefit both from the higher bandwidth that the login node has and even more important the internal connection that it has to the file server. For practical examples have a look at the Data transfer subsection further below.

overview of network connections

Nethome NFS server

Your home directory (/nethome/<USERNAME>/) is always the starting point no matter if you log in at a workstation, the cluster login node, or run a job on one of the compute nodes. The main advantage of having this directory stored on a network file server is that every single one of these machines accesses the same directory. So no matter where you log in your files and settings are already there, and changes are immediately available everywhere. The downside is that as soon as there are problems with the network connection or the file server your desktop session might freeze and the workstation won’t be usable until connectivity is restored. In that case it is best to wait and check with others in the office and on Zulip if someone is working on fixing the underlying issue. Do not reset the workstation, that does not help at all in these cases and might actually corrupt some of your files (web browser profile for example).

On a similar note, never remove the network cable from a workstation. That would lock up running processes and also make remote administration impossible. If you need to set up a third party VPN (one particular HPC center comes to mind) please bring it up in the Workstation stream on Zulip . It is important to ensure that the VPN does not interfere with traffic to the file server, otherwise your session will lock up as soon as you activate the VPN connection.

The file server stores the home directories of all users, so it is important to not run I/O intensive tasks on data stored there, because it puts a high load on the server and affects everyone else (e.g. terminal or desktop sessions start to lag). Instead use CephFS and local scratch space for data processing.

The size limit of 250 GB is a soft limit, so you can actually store more than that. A weekly status email with usernames that are above the quota is sent out to the mailing list to help keep track of the total capacity. This email is also sent to each individual user above the quota just in case they are not on the mailing list or do not pay attention to it. If your username shows up in that email it is time to clean up your home directory. If you cannot free up space right away, communicate your plan for doing so. The hard limit is 400 GB. At that point the server will simply reject storing more data which usually leads to login problems and possibly other collateral damage. To see your current usage you can use the command df, e.g.

$ df -h $HOME
Filesystem                                   Size  Used Avail Use% Mounted on
fs.biophysics.kth.se:/export/nethome/user1  400G   39G  362G  10% /nethome/user1
$ df -i $HOME
Filesystem                                     Inodes  IUsed     IFree IUse% Mounted on
fs.biophysics.kth.se:/export/nethome/user1 758196338 346882 757849456    1% /nethome/user1

The first command shows the space used on the storage server, the second command shows the number of inodes (files and directories) used. You should pay attention to the Used and IUsed columns, and stay below 250 GB and 2 million inodes, respectively. Your own software installations, especially conda environments can contain many files, so you should do some spring cleaning there every now and then.

Remember that data here is backed up each night and stays on the backup server for several months after deletion (that’s the whole point of backups, isn’t it?). So avoid (ab)using your home directory as temporary storage where you frequently dump and delete large amounts of data, because it will bloat the size of backups for months to come.

Note that the file server is doing on the fly data compression (compression before writing to disk, decompression before sending to the client). So the numbers you see above regarding used space refer to the size on the storage media. If you copy the data elsewhere (e.g. local scratch, CephFS etc,) it may require more than that, depending on the type of data. Here is an example with two tar archives (one compressed with the gzip algorithm, the other one uncompressed):

$ du -h gromacs-2024.2.tar*
59M gromacs-2024.2.tar
41M gromacs-2024.2.tar.gz

The uncompressed archive appears to be only 43% larger than the compressed one, which is odd for a source code archive (i.e. plain text files). The reason is that what you see here is how much space the file takes up on the storage server disks. If you want to get an idea of the real file size, you can add the flag --apparent-size:

$ du -h --apparent-size gromacs-2024.2.tar*
146M    gromacs-2024.2.tar
41M gromacs-2024.2.tar.gz

So the uncompressed archive is actually 256% larger than the one compressed with gzip. That makes more sense.

CephFS storage

Ceph is a distributed fault-tolerant network storage system. The file system part is accessible on all cluster nodes at the path /mnt/cephfs/. Since the data is stored on multiple servers and there is redundancy, a single server failure does not affect data availability. You can store large amounts of data here, but make sure to discuss this with your PI, since they are charged for the storage space their group uses.

Projects (/mnt/cephfs/projects/)

This directory contains the project volumes. The idea is that you should create as many volumes as you need in order to keep data well organized and separated by project. For Cryo-EM data for example you could create one volume for the raw data and then a second one for data processing. Make sure to enter a description when creating the volume so that someone else can make sense of the data. You may also want to document your data processing with additional files, and add for example the DOI of the paper the data was used for once it is published. When your paper is published, the volume(s) containing the data should be archived, therefore avoid putting data for different projects into the same volume. Nobody wants to sit down and reorganize data when it is time for archival.

To create a new project volume or update an existing one you use the command manage-volumes on the login node. Please read the information displayed by the program carefully. To get an overview of your projects (size, file count, expiration date) you can use the command lsvol. You can also limit the output by providing simple search terms, e.g. “lsvol glic ph” will only show projects that have both terms in the name (the search terms are case-insensitive). To make navigation between project directories easier you can use the command cdvol.

The project volumes are a time-limited storage option. Make sure to check the expiration dates of your volumes regularly (lsvol), and extend if necessary (manage-volumes). Expired volumes will eventually be transferred to tape and removed from CephFS.

You can allow others to store data in your volume by adding them in the manage-volumes dialog. This will however not give them write access to your existing data. If you want to do that you can use the command setfacl. You may be more familiar with the command chmod that sets POSIX permissions (user, group and other). ACLs (access control lists) allow more fine-grained control, so you can actually give access to individual users instead of whole user groups. The following command would set ACLs recursively on all files in a directory, to give a user full access to it (including permissions to delete existing files!):

$ setfacl -R -m user:<USERNAME>:rwx /mnt/cephfs/projects/<PROJECTNAME>/<DIRECTORY>

Databases (/mnt/cephfs/databases/)

Here we can store standardized databases (e.g. UniRef) to be used by several users, instead of having each user keeping their own copy. The data should be stored in a directory named after the release version of the database, e.g. /mnt/cephfs/databases/uniref/uniref30/2023-02/.

Resources (/mnt/cephfs/resources/)

Here we can store things not directly tied to a specific short-term research application. Either related to teaching, generally useful material, or infrastructure such as the cluster. Have a look at the file /mnt/cephfs/resources/README.txt for an overview.

Data transfer

Introduction

As briefly explained in the networks subsection, think about the network traffic and most efficient way to transfer data before you start. If you sit in front of your workstation and want to get data from a remote location it might be most intuitive to just fire off rsync and get the data. However unless you copy to the local scratch directory, this will be rather slow because you will be using a shared low-bandwidth (1 Gbit/s) connection to receive the data only to send it back to the file server over the same connection through several routers. Instead you could ssh to the cluster login node, and run the rsync command there. The login node has more bandwidth to external locations and on top of that a dedicated internal connection to the file server. And since data stored in your home directory is available everywhere you will still be able to look at the data from your workstation if that’s what you want.

rsync

The main advantages of rsync are that transfers can be resumed without losing progress and that it has a large number of options to control aspects of the copying process. A typical command to transfer data from a remote location (accessible via ssh) to a directory inside a project volume could look like this:

$ cd /mnt/cephfs/projects/<PROJECTNAME>
$ rsync -rlpt --info=progress2 user123@remote.example.org:/path/to/data/dir/  ./data_from_remote.example.com/

NB: a trailing slash on the source affects the directory level at the destination (see man rsync)

scp

This works similar to rsync, but with less options and transfers that have been interrupted will have to be started from the beginning. When possible use rsync. A typical example could look like this:

$ cd /mnt/cephfs/projects/<PROJECTNAME>
$ scp -r user123@remote.example.org:/path/to/data/dir  ./data_from_remote.example.com

This would copy the contents of the directory data from the remote system to the directory data_from_remote.example.com inside the current working directory.

sshfs

This should not be used for data transfer! Only use this for convenient access to remote data, e.g. to look at smaller files or to plot data. A useful example might be to mount1 a CephFS volume at your workstation to edit job scripts or look at results:

$ mkdir -p /mnt/$USER/<DIRNAME>
$ sshfs login.tcblab.org:/mnt/cephfs/projects/<PROJECTNAME> /mnt/$USER/<DIRNAME>

1 mount = attach a file system at a chosen directory (mount point) in the file system hierarchy. Instead of using a drive letter like in Windows, in Linux one uses a path to access the mounted file system.
NB: Do not create the mount point in your home directory, that might make it more difficult to recover from NFS outages.

When you are done unmount2 the directory again:

$ fusermount -u /mnt/$USER/<DIRNAME>

2 unmount = detach a file system from the file system hierarchy. The mount point becomes a regular directory again after unmounting the file system.
Why should this not be used to transfer data? As previously mentioned the workstations have a relatively slow network connection and no direct connection to the file servers. This would create a lot of unnecessary network traffic and negatively affect others in the office.

If you are trying to use sshfs on your own laptop or home computer you will have to provide your username and set the option for user mapping:

$ sshfs <USERNAME>@login.tcblab.org:/mnt/cephfs/projects/<PROJECTNAME> /path/to/mount/point -o idmap=user

NB: the option is literally “idmap=user“, do not replace user with your username.

tar

For data that you either don’t need to access again or that you can expect to access infrequently you could use tar to pack files into a compressed archive:

$ tar cfz output.tar.gz all.txt your.txt tiny.txt files.txt

One could also use this to copy data from a remote location to an archive, for example like this:

$ ssh user123@remote.example.org tar czf - /path/to/data/directory > remote_data_archive.tar.gz

By setting the filename to - we instruct tar to send the output to stdout instead of storing it in a file. That allows us to receive it on our machine and redirect it to a file there instead of creating an archive on the remote machine.

globusconnect

HPC centers may require using the Globus connect client for data transfers. By default the client only allows transfers to your home directory. Do not copy data to your home directory only to move it to CephFS (see nethome subsection)! The proper way to do this it to tell the client to use a different directory:

$ ./globusconnectpersonal -start -restrict-paths /mnt/cephfs/projects/<PROJECTNAME>

nice and ionice

When running longer data transfers you may want to do your best to not disturb others. You can do that by adding the following prefix to your command:

$ nice ionice -c 3 <DATA_TRANSFER_COMMAND>

nice sets low priority for CPU scheduling, while ionice -c 3 sets low priority for I/O.

Check your network usage with bmon

Sometimes for practical reasons, you may want to perform analysis from your local workstation on mounted remote file systems. If you are doing this, you should check that the process you are running locally is not affecting too much the network, as you are accessing non-local data. A practical way to do this is to use bmon :

$ bmon

This tool will open an interface in the console that will show you the use of your receiving bandwidth (RX) and your transmission bandwidth (TX) :

Note that there are different interfaces available. By default bmon will show you the bandwidth used for the loopback interface (lo), which is probably not the one you want to check. To change this you can use the up and down arrows of your keyboard to select the external interface (in our case enp4s0). Once you have selected the correct interface, you will be able to see your bandwidth consumption during any process that you may run. In example here is the bandwidth consumption during the copy of a large file:

As you can see it raises up to 100 MiB/second. This is close to the maximum bandwidth of a 1 Gbit/s connection (1000 divided by 8 is 125, but there is protocol overhead and other factors to consider in practice). One probably does not want to run processes with this consumption for a long time.
Finally if you want to have a different scale for the graphs, you can cycle the timescale from seconds to minutes to hours by pressing <TAB>.