Project description:
In this project, we aim to detect pathologies in chest X-rays using neural networks.
For this purpose, we explore two neural network architectures (a ResNeXt and EfficientNet architecture).
We train on the CheXpert dataset, which contains 224,316 images.
For each network, we will do hyperparameter optimization using a grid-search approach.
We will explore two optimizers (ADAM and a traditional momentum optimizer), 5 learning rates (0.001, 0.002, 0.005, 0.01 and 0.02) and 3 batch sizes (8, 16, 32), for a total of 2*5*3 = 30 hyperparameter settings per network architecture.
Project requirements:
Compute:
Each run will take an estimated 100 epochs to converge. We have run a small test on an NVIDIA GeForce 1080Ti GPU. For both ResNeXt and EfficientNet, it took 2 hours to run 1 epoch. We estimate that the A100 GPUs in Snellius are approximately two times faster. Thus, a single run would take an estimated 100 hours to complete on a single A100 GPU. Since a single GPU in Snellius costs 128 SBU/hour, training a single network with a single hyperparameter setting costs an estimated 100h*128 SBU/h = 12,800 SBU. With 30 hyperparameter settings for each of the two neural networks, we need to do 60 runs in total. To allow for trial and error, we request an additional 5%. Thus, the total requested amount of compute is 60 * 12,800 * 1.05 = 806,400 SBU (on the GPU partition).
Memory:
Test runs on a 1080Ti showed that a batch size of 8 fits in the 10 GB of GPU memory of the 1080Ti. Thus, we expect no problems running with a four times larger batch size of 32 on the A100 in Snellius, since the memory requirement will at most be four times larger, and will thus fit the 40 GB of GPU memory of an A100. We have no special requirements for CPU memory.
Storage:
We need to store three items:
- The CheXpert dataset (440 GB, 224,316 files)
- Intermediate files (checkpoints, logs) of the training runs (10 GB per run, 50 files per run)
- Final checkpoint & logs of each training (1 GB per run, 10 files per run)
The CheXpert dataset, final model checkpoints, and logs of each training will need to be stored for the duration of the project. The intermediate files generated during the run are temporary and can be removed after some initial analysis. We expect we don't need to store the intermediate files for more than 10 runs at any given time. Thus, we need a total of 440 GB + 10 * 10 GB + 60 GB = 600 GB of storage to store approximately 225,000 files. We therefore request 1 TB of Snellius project space (the minimum project space size).
Software:
We aim to use the PyTorch installation from the module environment to perform this training.
Resources:
- Resources: Snellius
- GPU Snellius: Yes
- SBU Snellius: 806,400
- Terabyte Project Space Snellius: 1TB