Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
You can apply through our Service Desk, via the link "Small Applications". These requests will only be assessed on technical feasibility by SURF and are usually handled within 2 weeks. The scope of these applications is listed on the table.
It is important to understand that the resources you request need to be justified. You need to detail how many SBUs you require and how you plan to use them. The above parameters are only the maximum limits of the application and are not granted by default. The resources of each application are tailored per project. To ensure a smooth and fast application process we have provided a template that you can use when you apply for a small request on either Snellius or Lisa. Please edit the template to fit your project.
Tip | ||
---|---|---|
| ||
|
Example application 1: training a neural network to detect pathologies in X-ray images
To make the above a bit more concrete, we include here an example application for an (imaginary) machine learning project. Note how the Project Description lists the work to be done in the project. The Project requirements translate that description of work to an overview of technical requirements (how many jobs, how many resources does each job require, etc), and this in turn justifies the total Resources that are requested.
Project description:
In this project, we aim to detect pathologies in chest X-rays using neural networks. For this purpose, we explore two neural network architectures (a ResNeXt and EfficientNet architecture). We train on the CheXpert dataset, which contains 224,316 images. For each network, we will do hyperparameter optimization using a grid-search approach. We will explore two optimizers (ADAM and a traditional momentum optimizer), 5 learning rates (0.001, 0.002, 0.005, 0.01 and 0.02) and 3 batch sizes (8, 16, 32), for a total of 2*5*3 = 30 hyperparameter settings per network architecture.
Project requirements:
Compute:
Each run will take an estimated 100 epochs to converge. We have run a small test on an NVIDIA GeForce 1080Ti GPU. For both ResNeXt and EfficientNet, it took 2 hours to run 1 epoch. We estimate that the A100 GPUs in Snellius are approximately two times faster. Thus, a single run would take an estimated 100 hours to complete on a single A100 GPU. Since a single GPU in Snellius costs 128 SBU/hour, training a single network with a single hyperparameter setting costs an estimated 100h*128 SBU/h = 12,800 SBU. With 30 hyperparameter settings for each of the two neural networks, we need to do 60 runs in total. To allow for trial and error, we request an additional 5%. Thus, the total requested amount of compute is 60 * 12,800 * 1.05 = 806,400 SBU (on the GPU partition).
Memory:
Test runs on a 1080Ti showed that a batch size of 8 fit in the 10 GB of GPU memory of the 1080Ti. Thus, we expect no problems running with a four times larger batch size of 32 on the A100 in Snellius, since the memory requirement will at most be four times larger, and will thus fit the 40 GB of GPU memory of an A100. We have no special requirements for CPU memory.
Storage:
We need to store three items:
- The CheXpert dataset (440 GB, 224,316 files)
- Intermediate files (checkpoints, logs) of the training runs (10 GB per run, 50 files per run)
- Final checkpoint & logs of each training (1 GB per run, 10 files per run)
The CheXpert dataset, final model checkpoints, and logs of each training will need to be stored for the duration of the project. The intermediate files generated during the run are temporary, and can be removed after some initial analysis. We expect we don't need to store the intermediate files for more than 10 runs at any given time. Thus, we need a total of 440 GB + 10 * 10 GB + 60 GB = 600 GB of storage to store approximately 225,000 files. We therefore request 1 TB of Snellius project space (the minimum project space size).
Software:
We aim to use the PyTorch installation from the module environment to perform this training.
Resources:
Resources: Snellius
GPU Snellius: yes
SBU Snellius: 806,400
Terabyte Project Space Snellius: 1
Service | |
---|---|
Lisa |
By default:
|
Snellius |
By default:
|