You can apply through our Service Desk, via the link "Small Compute applications (NWO)". These requests will only be assessed on technical feasibility by SURF and are usually handled within 2 weeks. Within this call, it is possible to request:

Service
Snellius
  • Maximum 1,000,000 CPU/GPU SBUs
  • Maximum of 10 TiB of project space
  • Maximum of 50 TiB of offline tape storage

By default:

  • 200 GiB home directory storage
  • Up to 8 TiB of shared scratch space.
  • 4 hours of support
  • Project duration 1 year.

Each applicant may submit one application per calendar year.

It is important to understand that the resources you request need to be justified. You must detail how many SBUs you require and how you plan to use them. The above parameters are only the maximum limits of the application and are not granted by default. The resources of each application are tailored per project. To ensure a smooth and fast application process we have provided a template that you can use when you apply for a small request on Snellius. Please edit the template to fit your project.


Small Application Template

  • The scientific purpose/background of the compute time application leads to the next point.
  • Provide a justification of the amount of requested SBUs in the form of a simple estimation. Something like "I'll need to do X (runs) using Y (node hours per run), with a runtime of Z (SBUs per node hour), i.e. totalling X*Y*Z = XXX SBUs" is sufficient. Please note: node hours = number of nodes (can be a fraction) * wall time (in hours). If you do different types of runs, please specify X/Y/Z/XXX for each type of run, as well as the total number of SBUs. More information on the SBUs per node hour (Z) can be found under the Accounting sections here.
  • Will your computations have a high memory requirement? If yes, can you make an estimation? This will cost you more SBUs per run therefore please adjust the SBUs per node hour (Z) in the above point accordingly. For hardware information refer to our pages for Snellius.
  • What are your storage requirements? What are the sizes of your input, intermediate and output datasets respectively? On Snellius, by default, you get 8 TiB of shared scratch space and 200 GiB of home directory storage per login. The shared scratch space is cleared every 14 days, therefore you will either need to move the relevant data to your home directory or download data to your local storage. Your home directory storage is backed up. It is recommended to run all your simulations from the shared scratch space because it is a parallel file system and supports much higher I/O throughput and NOT from the home directory space.
  • After considering the above defaults that you get, please think if you really need online storage or project space. Project spaces are NOT meant to be used as long-term storage and are NOT backed up. Therefore, it is the user's responsibility to back the data up in a timely manner. In case of a crash and data loss from project spaces, SURF is not liable.
  • Can your software application/applications run in a parallel manner? What type/types of parallelism do they use? Can you run multiple applications concurrently if you are using less than 32 cores per run on Snellius?
  • Each job on Snellius can run for a maximum of 5 days. If your application needs more than 5 days, then does it have a restart workflow?
  • Does the default storage mentioned in the 4th point above not satisfy your storage needs? If yes, then you might need a project space. If you are requesting project space on Snellius: What are your needs in terms of storage? What is the typical input and output size of your computations? Do you require long-term storage? Please have a look at the Snellius configuration page to check the differences between the different file systems on Snellius, and explain why you need the requested project space (permanent non backed-up space on high volume, high data throughput file systems that support parallel I/O). Do you have access to local storage where you will be able to copy your data back after the expiration of the project?
  • Do you have specific software needs? Please be very specific up to the exact version numbers and check if these are already present on our system. If they are not present do you need extra assistance to install the software? Please refer to our software policy for requests with consultancy hours.

Example application 1: Machine Learning

To make the above a bit more concrete, we include here an example application for an (imaginary) machine learning project. Note how the Project Description lists the work to be done in the project.
The Project requirements translate that description of work to an overview of technical requirements (how many jobs, how many resources each job requires, etc), and this in turn justifies the total Resources that are requested.

Training a Neural Network to detect pathologies in X-ray images

Project description:

In this project, we aim to detect pathologies in chest X-rays using neural networks.
For this purpose, we explore two neural network architectures (a ResNeXt and EfficientNet architecture).

We train on the CheXpert dataset, which contains 224,316 images.

For each network, we will do hyperparameter optimization using a grid-search approach.
We will explore two optimizers (ADAM and a traditional momentum optimizer), 5 learning rates (0.001, 0.002, 0.005, 0.01 and 0.02) and 3 batch sizes (8, 16, 32), for a total of 2*5*3 = 30 hyperparameter settings per network architecture.

Project requirements:

Compute:
Each run will take an estimated 100 epochs to converge. We have run a small test on an NVIDIA GeForce 1080Ti GPU. For both ResNeXt and EfficientNet, it took 2 hours to run 1 epoch. We estimate that the A100 GPUs in Snellius are approximately two times faster. Thus, a single run would take an estimated 100 hours to complete on a single A100 GPU. Since a single GPU in Snellius costs 128 SBU/hour, training a single network with a single hyperparameter setting costs an estimated 100h*128 SBU/h = 12,800 SBU. With 30 hyperparameter settings for each of the two neural networks, we need to do 60 runs in total. To allow for trial and error, we request an additional 5%. Thus, the total requested amount of compute is 60 * 12,800 * 1.05 = 806,400 SBU (on the GPU partition). 

Memory:
Test runs on a 1080Ti showed that a batch size of 8 fits in the 10 GB of GPU memory of the 1080Ti. Thus, we expect no problems running with a four times larger batch size of 32 on the A100 in Snellius, since the memory requirement will at most be four times larger, and will thus fit the 40 GB of GPU memory of an A100. We have no special requirements for CPU memory.

Storage:
We need to store three items:

  • The CheXpert dataset (440 GB, 224,316 files)
  • Intermediate files (checkpoints, logs) of the training runs (10 GB per run, 50 files per run)
  • Final checkpoint & logs of each training (1 GB per run, 10 files per run)

The CheXpert dataset, final model checkpoints, and logs of each training will need to be stored for the duration of the project. The intermediate files generated during the run are temporary and can be removed after some initial analysis. We expect we don't need to store the intermediate files for more than 10 runs at any given time. Thus, we need a total of 440 GB + 10 * 10 GB + 60 GB = 600 GB of storage to store approximately 225,000 files. We therefore request 1 TB of Snellius project space (the minimum project space size).

Software:
We aim to use the PyTorch installation from the module environment to perform this training.

Resources:

  • Resources: Snellius
  • GPU Snellius: Yes
  • SBU Snellius: 806,400
  • Terabyte Project Space Snellius: 1TB
  • No labels