Synopsis

Here we summarize important changes for existing Cartesius users that start to work on Snellius. It contains information on the data migration of user files, the software environment, batch system and more. This page lists the main differences between Snellius and Cartesius, and will refer to the full documentation where applicable.

Known issues

Snellius is a new system, and you may find issues in the first few weeks of usage. Some of the issues are known to us and are in the process of being fixed. Please see the separate page Snellius known issues for things we are aware of and you do not need to report to the Service Desk.


Data migration: check if all your files and directories are present

As described in the Cartesius to Snellius migration page, SURF has migrated any relevant user data from Cartesius into Snellius for you.

Please check that all the files in your home directories and project spaces assigned to you (if any) are correctly migrated. If you find files are missing then please contact our Service Desk right away.

Home directories

The home directory of a Cartesius login was migrated if the following two conditions both applied:

  • The login is associated with an SBU account that was active on, or after, the Cutoff Date of 1 June 2021 (i.e. did not expire before the Cutoff Date).
  • A valid Usage Agreement for the login exists, has been accepted/duly renewed by the person to whom the login was handed out to. This agreement can be reviewed and accepted here.

Cartesius home backup available only until 31 December 2021

Note that for home directories a daily backup service is maintained, both on Cartesius and Snellius. Offline backups of Cartesius home directories will be kept, until 31 December 2021, including for backups of directories that are not migrated. Consequently, non-migrated home directories will become unavailable and non-restorable after 31 December 2021.

Because of the different block size between Snellius and Cartesius, disk occupancy of the files in your home on Snellius may exceed the 200GB quota. This will results in an error when trying to create new files of the type:

error: file write error: Disk quota exceeded

Please try to free up space by removing unused files and/or creating compressed archives (you can use the scratch filesystem to temporary stores the archived files, before cleaning up your home).  


Project spaces

A project space was migrated to Snellius if the following two conditions both applied to at least one member login of the group co-owning the project space:   

  • The login is associated with an SBU account that was active on, or after, the Cutoff Date (i.e. did not expire before the Cutoff Date).
  • A valid Usage Agreement for the login exists, has been accepted/duly renewed by the person to whom the login was handed out to. This agreement can be reviewed and accepted here.

The group co-owing the project space is the group of logins that share the allocated disk quota and have read and write access to the project space root.

Note that for project spaces no backup service is in place, as project space is a user-managed scratch resource, not a data-preservation resource. All project space data that is not included in the above will not be migrated to Snellius and will become unavailable as soon as Cartesius is taken offline.


Scratch spaces

Scratch - Possible Data Loss

Files that resided on scratch filesystems of Cartesius were not migrated to Snellius.


Non-native file systems

Archive

The migration only pertained to data on native Cartesius filesystems. In particular, data associated with the same login, but residing on the SURF Data Archive facility, are not affected in any way. 

Porting your environment to Snellius: hidden files and directories

In the home directories of Cartesius users that have been migrated to Snellius, there are a number of files and subdirectories that play a crucial role in the setup of the login-specific environment and the configuration of particular applications. Obvious examples are files like ".bashrc", ".cshrc", and ".vimrc", and directories like ".ssh" or ".matlab". More generally, the files and directories that have such a constituting and customising role for the personal user environment, have a name and location with the following characteristics:

  • They are located directly in the root of your home directory
  • Their name starts with a dot character ("."), e.g. ".bashrc". This makes them "hidden" directory entries

Since Snellius is different from Cartesius, a substantial amount of the contents in those files will not work correctly on Snellius at all, or lead to unexpected, erratic, or unwanted behaviour on Snellius. As these files need to be adapted to the new Snellius environment SURF applies the following modification to your home directory contents, after the final data migration synchronisation run:

  1. A subdirectory, named "CARTESIUS.hidden-directory-entries", is created in the root of your home directory.
  2. All files and directories directly in the root of your home directory that have an name that start with a “.” are subsequently moved into the newly added subdirectory denoted in step 1.

    There is one exception to this rule: Since logins associated with a PRACE account can only log in by means of ssh key-based (passwordless) authentication, the ".ssh" subdirectory of their home directory will NOT be moved. For non-PRACE users the ".ssh" directory is moved into the "CARTESIUS.hidden-directory-entries" directory.

  3. For a small number of environment bootstrapping files - ".bash_profile", ".bashrc", ".cshrc" – a standardized version, with minimal contents, and suited for Snellius, is created in the home directory for you.

Further customisation of hidden files and directories is up to the user, especially the files mentioned in step 3. You can, of course, port (elements of) scripts in the Cartesius.hidden-directory-entries  back into scripts in the root of your Snellius home directory.

Note also that if you  have locally built modules on Cartesius with eblocalinstall then these will be located in the directory .local/easybuild  within the CARTESIUS.hidden-directory-entries  directory. See also the section on locally built modules below.

The CARTESIUS.hidden-directory-entries  subdirectory at first glance may appear to be empty. The entries that have been moved in there are by definition "hidden". To list them, you must use a command that show hidden files, such as "ls -la".

In the highly unlikely case that the name "CARTESIUS.hidden-directory-entries" conflicts with a migrated, and therefore already existing, entry in your home directory, the name of the new subdirectory is slightly modified, extended with a short suffix, to resolve the name conflict.

Updated module environment

On Snellius we updated the environment module system to LMOD. This brings changes in the way modules can be loaded and used on Snellius. Full information can be found in our Environment Modules tutorial, or the section Loading modules in our HPC User Guide on how to write job scripts.

No default module versions

One main difference with Cartesius is that default modules are not set on Snellius, so users need to specify the full module name (including version) in order to properly load a module. For example:

# Two different module versions are available for HDF5
[paulm@int3 ~]$ module avail HDF5

------------------------------------------------------------------------------------------- /sw/arch/Centos8/EB_production/2021/modulefiles/data --------------------------------------------------------------------------------------------
   HDF5/1.10.7-gompi-2021a    HDF5/1.10.7-iimpi-2021a

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

# Can't load a module without specifying the exact version you want to load
[paulm@int3 ~]$ module load HDF5
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be loaded as requested: "HDF5"
   Try: "module spider HDF5" to see how to load the module(s).

# Load module based on full version number
[paulm@int3 ~]$ module load HDF5/1.10.7-gompi-2021a
[paulm@int3 ~]$ 

Searching/listing modules

To search for a specific module by name the module avail (abbr. module av ) and module spider commands are available. If you don't provide any options the module avail  command will list all available modules. The command in LMOD will search not just on the module name (e.g. GCC), but the full module version string. This allows more flexible searching, but also might return a lot of results:

[paulm@int3 ~]$ module av Python

------------------------------------------------------------------------------------------- /sw/arch/Centos8/EB_production/2021/modulefiles/tools -------------------------------------------------------------------------------------------
   IPython/7.25.0-GCCcore-10.3.0

------------------------------------------------------------------------------------------- /sw/arch/Centos8/EB_production/2021/modulefiles/lang --------------------------------------------------------------------------------------------
   Python/2.7.18-GCCcore-10.3.0-bare    Python/3.9.5-GCCcore-10.3.0-bare    Python/3.9.5-GCCcore-10.3.0

------------------------------------------------------------------------------------------- /sw/arch/Centos8/EB_production/2021/modulefiles/devel -------------------------------------------------------------------------------------------
   pkgconfig/1.5.4-GCCcore-10.3.0-python


[paulm@int3 ~]$ module av 6.2.1

------------------------------------------------------------------------------------------- /sw/arch/Centos8/EB_production/2021/modulefiles/math --------------------------------------------------------------------------------------------
   GMP/6.2.1-GCCcore-10.3.0


# Many results, due to matching of "GCCcore"
[paulm@int3 ~]$ module av GCC

----------------------------------------------------------------------------------------- /home/paulm/.local/easybuild/Centos8/2021/modulefiles/all -----------------------------------------------------------------------------------------
   freeglut/3.2.1-GCCcore-10.3.0    glew/2.1.0-GCCcore-10.3.0    Mesa-demos/8.4.0-GCCcore-10.3.0

------------------------------------------------------------------------------------------- /sw/arch/Centos8/EB_production/2021/modulefiles/phys --------------------------------------------------------------------------------------------
   UDUNITS/2.2.28-GCCcore-10.3.0

...

You can use a slash ("/") in the search string to match only the package name part. For example, to list only the available versions of the GCC module:

[paulm@int3 ~]$ module av GCC/

----------------------------------------------------------------------------------------- /sw/arch/Centos8/EB_production/2021/modulefiles/compiler ------------------------------------------------------------------------------------------
   GCC/10.3.0 (L)

  Where:
   L:  Module is loaded

The module avail  command does a case-insensitive search, while the module load  and unload  commands are case-sensitive to the name of the module provided.

Module spider

The module spider  command is an alternative and can be used to get more detailed information on a specific module. For example, for module Python we can show its description and list all available versions:

[paulm@int3 ~]$ module spider Python

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Python:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
      Python is a programming language that lets you work more quickly and integrate your systems more effectively.

     Versions:
        Python/2.7.18-GCCcore-10.3.0-bare
        Python/3.9.5-GCCcore-10.3.0-bare
        Python/3.9.5-GCCcore-10.3.0
     Other possible modules matches:
        IPython

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*Python.*'

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "Python" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider Python/3.9.5-GCCcore-10.3.0
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Module dependencies

Another difference is that dependencies between modules are tracked more accurately, so when you unload a module all its dependencies (if possible) will be unloaded as well. So you no longer have to use module purge in certain cases.


The new software stack (2021)

The modules environment on Snellius (and Lisa) provides the new 2021 modules collection. This contains recent versions of the many software packages that were available in the 2020 environment on Cartesius.

Note that the 2020 and 2019 collections that were available on Cartesius are not available on Snellius. The latter is due to differences between Snellius and Cartesius that would make it time-consuming to install and test all the software packages in the previous environments:

  • Snellius contains (mostly) nodes with AMD CPUs, versus Intel CPUs on Cartesius
  • Snellius uses a different operating system compared to Cartesius

Locally built modules

It is possible, through the EasyBuild system that is available on Snellius, to build your own modules. For example, if you want to have a different version of a module then is available in the global modules environment. Or if you want to add a package that we do not provide globally. Installing local modules could already be done on Cartesius, but the modules you might have built on Cartesius will not work on Snellius, due to CPU and OS differences. Those locally-installed modules therefore will need to be rebuilt on Snellius using EasyBuild. See EasyBuild tutorial for general instructions.

Existing locally installed modules you built on Cartesius will be located in your migrated home directory on Snellius in CARTESIUS.hidden-directory-entries/.local/easybuild/RedHatEnterpriseServer7  (as per the porting environment section above). It is recommended to start over fresh and locally build the modules on Snellius that you need with eblocalinstall  (which is part of the eb  module), instead of trying to reuse files from this directory.

After locally installing the modules you need you can then completely remove the directory RedHatEnterpriseServer7.

SURF(sara) compiler wrappers

These are no longer used on Snellius. When you load, say, GCC/10.3.0 the gcc command will simply refer to /sw/arch/Centos8/EB_production/2021/software/GCCcore/10.3.0/bin/gcc, instead of to a wrapper script.

The batch system

On Snellius the same SLURM scheduling system as on Cartesius is used to submit and control user jobs. There are some differences in its setup, though.

Updated partitions

The partitions available on Snellius are described in the Snellius usage and accounting page. Some of partitions names have been changed to reflect the different type of compute nodes, please check the table for more information on each partition configuration.

On Cartesius each partition had a "..._short" version that was limited to jobs of up to 1 hour walltime, for example "gpu" and "gpu_short". On Snellius "short" jobs with a walltime of at most 1 hour no longer have to be submitted to a separate partition. In fact, "..._short" partitions are not available anymore. For the "thin", "fat" and "gpu" partitions the SLURM scheduler always keeps a number of nodes available that can only run short jobs. This provides the benefit of the "short" partitions of Cartesius without having to explicitly submit a job to a different partition.

Since the partitions are now homogeneous in terms of hardware they contain, there is no more reason to use constraints to select specific CPU types, as was needed on Cartesius. 

Shared jobs and single-node jobs

On Snellius single node jobs from different users can execute simultaneously on the same node, known as "shared jobs". This is partly needed to get efficient node usage due to the larger number of CPU cores in the Snellius nodes compared to Cartesius. From accounting point of view, you will be charged only for the resources allocated to your job. Check the "Shared usage accounting" section in the Snellius usage and accounting page for more information.

Jobs that only use a single node (either by using "-N 1", or when not specifying the number of nodes) are now by default started as a shared job. SLURM will also warn you about this:

# Single node (-N 1) is implicit, shared job
[paulm@int3 ~]$ sbatch -t 1:00 hostname.job
sbatch: Single node jobs run on a shared node by default. Add --exclusive if you want to use a node exclusively.
Submitted batch job 2070

# Single node exclusive job
[paulm@int3 ~]$ sbatch -t 1:00 --exclusive hostname.job
Submitted batch job 2071

If you want to make sure your job is the only one running on a node (known as "exclusive use") then you can use the --exclusive  option with sbatch.

Resources on shared node are "partitioned" by the system using cgroups, so that you can access and see only the resources allocated to your job (e.g: if you request 1/4 of a thin node, you will have access to 32 cores and 64 GiB of memory). This does not apply to the memory bandwidth which is not partitioned and it is shared between jobs on the same node. If your code has intensive access to memory, you may want to consider exclusive node usage (even though you are using only part of the cores on the node) since access to memory could be the limiting factor for the performances of your application.

Note that the allocated resources (full or part of the node)  will always be accounted and subtracted from your budget, regardless of the actual resources used in the job. For example, if your job only uses 32 out of 128 CPU cores in a node, but you submit it as an exclusive job, then it will be accounted for the full 128 cores. So one of the benefits of using shared jobs is that you can have smaller resource usage by only using part of a node.

Minimum shared job size

Even though a shared job will only use part of a node, there are limits in place on the minimum size of a shared job. For example, you cannot allocate a job that uses only a single CPU core, a job needs to allocated at least 1/4th of the CPU cores in a node. See Snellius usage and accounting for the limits per partition and details on the accounting of shared jobs.

Environment no longer exported into the job

On Cartesius whenever you submitted a batch job with sbatch  the current shell environment available to the sbatch  command would get copied into the job. This could lead to unexpected results that were sometimes hard to debug, as the job script by itself did not fully specify the environment in which the job would run (as the current environment when sbatch  was called influenced the job).

Therefore, on Snellius this has changed. By default environment variables are no longer exported into the job.

  • This change only applies to exporting variables from the current shell. A job script will usually start a new shell instance (e.g. /bin/bash ), which will also set variables in its startup, such as those from the .bashrc  configuration file.
  • A second thing to note with the new behaviour is that currently loaded modules will also not get exported into the job (as loading a module, for the most part, sets/updates a number of environment variables).

An example of the new behaviour:

# A simple job script that prints the current shell environment
snellius paulm@int3 16:23 ~$ cat env.job 
#!/bin/sh
#SBATCH -t 00:05:00
#SBATCH -p thin
env

# We set a variable MYVAR in the current shell and submit the job script
snellius paulm@int3 16:23 ~$ export MYVAR=1
snellius paulm@int3 16:23 ~$ sbatch env.job 
sbatch: Single node jobs run on a shared node by default. Add --exclusive if you want to use a node exclusively.
Submitted batch job 31744
# But the variable MYVAR is not available in the job's environment
snellius paulm@int3 16:23 ~$ grep MYVAR slurm-31744.out 
snellius paulm@int3 16:24 ~$

Best practice

In order to have an environment variable and/or module available in a job it is best to define these in the job script itself. This isolates the job script from any current environment variables and loaded modules. An example:

snellius paulm@int3 17:00 ~$ cat var.job
#!/bin/sh
#SBATCH -t 00:05:00
#SBATCH -p thin
module load 2021
module load Python/3.9.5-GCCcore-10.3.0

export MYVAR=1

# Use the MYVAR variable
python -c 'import sys, os; print("MYVAR = %s" % os.environ["MYVAR"])'

snellius paulm@int3 17:01 ~$ export MYVAR=9999
snellius paulm@int3 17:01 ~$ sbatch var.job 
sbatch: Single node jobs run on a shared node by default. Add --exclusive if you want to use a node exclusively.
Submitted batch job 31921
snellius paulm@int3 17:01 ~$ grep MYVAR slurm-31921.out 
MYVAR = 1
snellius paulm@int3 17:01 ~$ 

Previous behaviour

If you want to get back the old behaviour (i.e. export all environment variables into the job) then you can use the --export=ALL option:

# We submit the exact same job script, with MYVAR still set
snellius paulm@int3 16:24 ~$ sbatch --export=ALL env.job 
sbatch: Single node jobs run on a shared node by default. Add --exclusive if you want to use a node exclusively.
Submitted batch job 31748
# Now, the variable MYVAR *is* available in the job script
snellius paulm@int3 16:24 ~$ grep MYVAR slurm-31748.out
MYVAR=1
snellius paulm@int3 16:24 ~$

  • No labels