Snellius hosted datasets

On Snellius, we have installed and prepared a list of datasets frequently used to either train or benchmark a model, usually in the context of machine learning. Instead of occupying space on your own space or waiting for the download of the data to finish to your own space, freely use the available datasets at the dataset folder on Snellius.

Importantly, the root of most datasets folders is /scratch-nvme/ml-datasets/ or /projects/2/managed_datasets/

For the data storage and conversion we use Python as a framework.

License: CC BY-NC-SA 3.0 (https://creativecommons.org/licenses/by-nc-sa/3.0/)

This means that you must attribute the work in the manner specified by the authors, you may not use this work for commercial purposes and if you alter, transform, or build upon this work, you may distribute the resulting work only under the same license.

Dataset or model not listed?

If the dataset or model is missing, it can be downloaded or uploaded to Snellius. Please contact us if you think other people would also use this model or dataset, we can then add a copy of this to the public model and dataset space. This way, we alleviate having many duplicates of models or datasets on the system and users needing to download or uploaded from external sources. Of course, if your dataset or model is proprietary or privacy-sensitive, this does not apply.

Getting access to restricted datasets and models

Some datasets and models are not accessible by default on Snellius because they require explicit acceptance of a license or agreeing to a terms of use on the website of the dataset or model provider.

If you would like to access these datasets or models on Snellius, please send a ticket to https://servicedesk.surf.nl with a screenshot of the dataset or model provider giving you access to the data.

Even if access to a datasets is not restricted, it usually still has a license and a terms of conduct.
By using the dataset or model you are agreeing to both the license and the terms of conduct.

How do I load a hf_cache_dir model or dataset?

Set the cache_dir option of various huggingface commands, e.g.:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "gpt2" # or e.g. meta-llama/Llama-3.3-70B-Instruct

cache_dir = "/projects/2/managed_datasets/hf_cache_dir"
cache_kwargs = dict(cache_dir=cache_dir, local_files_only=True)

model = AutoModelForCausalLM.from_pretrained(model_id, **cache_kwargs)
tokenizer = AutoTokenizer.from_pretrained(model_id, **cache_kwargs)

# or using the pipeline syntax
pipe = pipeline('text-generation', model=model_id, model_kwargs=cache_kwargs)

Available Models

Model name	Free access	Path on Snellius	Available versions	License	Description	Website	Size
Llama3.3	❌	`/projects/2/managed_datasets/hf_cache_dir`	70B-Instruct	Proprietary (community license)	-	https://llama.meta.com/	132GB
Llama3.2	❌	`/projects/2/managed_datasets/hf_cache_dir`	3B-Instruct	Proprietary (community license)	-	https://llama.meta.com/	6GB
Llama3.1	❌	`/projects/2/managed_datasets/llama3`	405B-MP16 405B-Instruct-MP16 70B 70B-Instruct 8B 8B-Instruct	Proprietary (community license)	-	https://llama.meta.com/	-
Llama3	❌	`/projects/2/managed_datasets/llama3`	8B 8B-Instruct 70B 70B-Instruct	Proprietary (community license)	-	https://llama.meta.com/	-
Llama2	❌	`/projects/2/managed_datasets/llama`	7B 7B-chat 13B 13B-chat 70B 70B-chat	Proprietary (community license)	-	https://llama.meta.com/	-
CodeLlama2	❌	`/projects/2/managed_datasets/codellama`	7B 7B-instruct 7B-python 13B 13B-instruct 13B-python 34B 34B-instruct 34B-python 70B 70B-instruct 70B-python	Proprietary (community license)	-	https://llama.meta.com/	-
Mistral	✅	`/projects/2/managed_datasets/hf_cache_dir`	7B-v0.1 7B-Instruct-v0.1 7B-Instruct-v0.2	Apache 2.0	-	https://huggingface.co/mistralai https://mistral.ai/	-
Mixtral	✅	`/projects/2/managed_datasets/hf_cache_dir`	8x7B-v0.1 8x7B-Instruct-v0.1 8x22B-v0.1 8x22B-Instruct-v0.1	Apache 2.0	-	https://mistral.ai/ https://huggingface.co/mistralai	-
Phi-4	✅	`/projects/2/managed_datasets/hf_cache_dir`	14B	MIT	-	https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090 https://huggingface.co/microsoft/phi-4	28G
Phi-3	✅	`/projects/2/managed_datasets/hf_cache_dir`	mini-4k-instruct mini-128k-instruct	MIT	-	https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3	-
Phi-2	✅	`/projects/2/managed_datasets/hf_cache_dir`	N/A	MIT		https://huggingface.co/microsoft/phi-2
Whisper	✅	`/projects/2/managed_datasets/hf_cache_dir`	large-v3	Apache 2.0		https://huggingface.co/openai/whisper-large-v3
GPT-2	✅	`/projects/2/managed_datasets/hf_cache_dir`	base medium large xl	MIT	-	https://huggingface.co/openai-community?sort_models=likes#models
AlphaFold 2	✅	`/projects/2/managed_datasets/AlphaFold/2.3.1`/params		Apache 2.0	The model can be used through these modules in the 2022 stack: AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0 and AlphaFold/2.3.1-foss-2022a For more info, see `/projects/2/managed_datasets/AlphaFold/README.md`	https://github.com/google-deepmind/alphafold

Available Datasets

Dataset	Free access	Path on Snellius	Available versions	License	Description	Website	Size
ADE20K	✅	`/projects/2/managed_datasets/ADE20K`	23-02-2024	ADE20K License	ADE20K is composed of more than 27K images from the SUN and Places databases. Images are fully annotated with objects, spanning over 3K object categories. Many of the images also contain object parts, and parts of parts. The original annotated polygons are also provided, as well as object instances for amodal segmentation. Images are also anonymized, blurring faces and license plates.	ADE20K Website	2.3GB
AlphaFold 2	✅	`/projects/2/managed_datasets/AlphaFold/2.3.1`	2.3.1	Apache 2.0	AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment. The model can be used through the AlphaFold/2.3.1-foss-2022a-CUDA-11.7.0 and AlphaFold/2.3.1-foss-2022a modules in the 2022 stack. For more info, see `/projects/2/managed_datasets/AlphaFold/README.md.`	AlphaFold 2 Info	2.62TB
AlphaFold 3	✅ (Code and database)	`/projects/2/managed_datasets/AlphaFold/3.0.0`	3.0.0	Creative Commons Attribution-Non-Commercial ShareAlike International License, Version 4.0.	Compared to AlphaFold 2, AlphaFold 3 improves the prediction accuracy and extends its capabilities to predict structures of protein complexes, including interactions with DNA, RNA, ligands, and ions. Note: Only contains the data. The model weights are subject to restrictive terms and are not hosted on Snellius for the time being. To obtain them you can request access from Google Deepmind directly through their form. We have a temporary container module in the 2024 stack to run the pipeline: AlphaFold/3.0.0-foss-2024a-CUDA-12.6.0 For more information, see `/projects/2/managed_datasets/AlphaFold/AF3_README.md`	AlphaFold 3 Info	628GB
BDD100k (Berkeley Deep Drive 100k)	✅	`/scratch-nvme/ml-datasets/bdd100k`	-	BSD 3-Clause License	BDD100K is a diverse driving dataset for heterogeneous multitask learning.	BDD100K Website	2TB
CC3M ⚠️ INCOMPLETE ⚠️ See dataset description	✅	`/projects/2/managed_datasets/ConceptualCaptions`	-	-	CC3M is a dataset of 3 million image and caption. Downloading CC3M is not possible due to the way it is setup: it is provided as a list of URLs, many of which are broken links. Should you have a complete mirror or duplicate of this dataset, please contact us. An available subset of the dataset is provided as-is, please see `/projects/2/managed_datasets/ConceptualCaptions/downloaded_report.ipynb` on the status of this dataset. Alternatively: use CC12M, which is complete.	CC3M Website	2GB
CC12M	✅	`/projects/2/managed_datasets/CC12M`	-	-	Conceptual 12M (CC12M), a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-training. It is larger and covers a much more diverse set of visual concepts than CC3M.	CC12M info	2GB
CheXpert	✅	`/projects/2/managed_datasets/chexpert`	v1.0 v1.0 small	-	The CheXpert dataset is a large collection of chest X-ray images and corresponding radiology reports designed for automated interpretation of medical imaging. It includes over 224,000 X-rays labeled for common pathologies like pneumonia, cardiomegaly, and edema. The dataset provides uncertainty labels, enabling robust AI model training, and is widely used in medical research for advancing diagnostic accuracy.	https://stanfordmlgroup.github.io/competitions/chexpert/	450GB
CIFAR10	✅	`/scratch-nvme/ml-datasets/cifar-10`	-	-	CIFAR10 is an image database consisting of 60k 32x32 color images for image classification.	CIFAR10 Info	162MB
CIFAR100	✅	`/scratch-nvme/ml-datasets/cifar-100`	-	-	CIFAR10 is an image database consisting of 60k 32x32 color images for image classification.	CIFAR Info	162MB
Cityscapes	✅	`/scratch-nvme/ml-datasets/cityscapes`	-	Cityscapes License	Cityscapes is a large-scale dataset of stereo street video sequences with 5000 pixel-level annotations and 20k 'weak' annotations. Its primary purpose is to assess semantic segmentation on scene understanding (pixel-level, instance-level, and panoptic).	Cityscapes Info	1.9TB
co3d	✅	`/projects/2/managed_datasets/co3d`	-	License	co3d (Common Objects in 3D) is a dataset designed for 3D object recognition and reconstruction. It contains multi-view images of common objects with annotations for 3D object reconstruction tasks. The dataset includes 3D models, camera parameters, and segmentation masks.	co3d info	5.5TB
COCO (a.k.a. MSCOCO)	✅	`/projects/2/managed_datasets/COCO`	2017	-	MS Coco dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K color images. Most benchmarks are reported on the COCO 2017 images.	COCO Website	46GB
Common Voice	❌	`/projects/0/managed_datasets/common_voice`	21.0	CC-0	The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 26119 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. NOTE: hosted dataset currently only contains English (en) and Dutch (nl) splits of the data	Common Voice website	90GB
fastMRI	✅	`/scratch-nvme/ml-datasets/fastmri`	07 Jun 2024 breast extension downloaded 04 Dec 2024	fastMRI dataset agreement	fastMRI is a collaborative research project from Facebook AI Research (FAIR) and NYU Langone Health to investigate the use of AI to make MRI scans faster. NYU Langone Health has released fully anonymized knee and brain MRI datasets that can be downloaded from the fastMRI dataset page. Publications associated with the fastMRI project can be found at the end of this README	fastMRI Website	6.7TB
FFHQ	✅	/projects/2/managed_datasets/FFHQ	2019	-	Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN) We only have the 1024x1024 images.	https://github.com/NVlabs/ffhq-dataset	90GB
FineWeb Edu	✅	(when using Huggingface): /projects/2/managed_datasets/hf_cache_dir/	-	-	FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from FineWeb dataset. This is the 1.3 trillion version.	FineWeb Edu Info	8TB
GigaCorpus	✅	`/projects/2/managed_datasets/GigaCorpus`	v1 March 2023	-	With 234GB of varied plaintext, as much as 40 billion tokens, this is at least the largest Dutch corpus. But in addition this corpus is also freely available and the quality is relatively high for its size, care has been taken to get the data as clean as possible. Also, the corpus contains 400 million forum posts in 10 million threads with their timestamp intact for linguistic research.	GigaCorpus Info	500GB
HowTo100M	✅	`/projects/2/managed_datasets/HowTo100M`	Downloaded as of 2024-11	Apache v2.0	HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of: 136M video clips with captions sourced from 1.2M Youtube videos (15 years of video) 23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness Each video is associated with a narration available as subtitles automatically downloaded from Youtube.	https://www.di.ens.fr/willow/research/howto100m/	800GB
HYPFLOWSCI6	✅	`/projects/2/managed_datasets/hypflowsci6_v1.0`	V1.0	GPLv3	The datapackage HYPFLOWSCI6 (HYdrological Projection of Future gLObal Water States with CMIP6) contains a simulation dataset of global hydrology and water resource conditions covering the historical/past years from 1960 to the future projected period until 2100. The dataset has 5 arc-minute spatial resolution (about 10 km at the equator) and monthly temporal resolution.	HYPFLOWSCI6 Info	80GB
ImageNet	❌	`/scratch-nvme/ml-datasets/imagenet` `/scratch-nvme/ml-datasets/imagenet21k`	1k 21k	ImageNet License	ImageNet is a famous image database of various resolutions for image classification collected from Flickr and other external websites.	ImageNet Info	1k 21k: 1.2TB
Kinetics	✅	`/scratch-nvme/ml-datasets/kinetics`	kinetics 700-2020	-	Kinetics is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400/600/700 video clips. Each clip is human annotated with a single action class and lasts around 10 seconds.	Kinetics Info	875G
KITTI	✅	`/projects/2/managed_datasets/KITTI-360`	-	KITTI License	KITTI is an image/video dataset from traffic scenarios for computer vision tasks like stereo, optical flow, visual odometry, 3D object detection 3D tracking and semantic segmentation (without annotations).	-	-
LAION400M	✅	`/projects/2/managed_datasets/laion400m`	-	-	LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search.	Laion 400M info	10TB
LAION High Resolution	✅	`/projects/2/managed_datasets/laion-high-resolution-output`	-	-	Laion high resolution is a >= 1024x1024 subset of laion5B.	laion-high-resolution	50TB
LLaVA-CC3M-Pretrain-595K	✅	Virtual path (when using Huggingface): `/projects/2/managed_datasets/hf_cache_dir/` Real path (raw images): `/projects/2/managed_datasets/hf_cache_dir/downloads/extracted/30814bc1b79e86b8e7ef21b088d25da3ba559b0b6a36848dfd9ff92e75a62604`	-	-	LLaVA Visual Instruct CC3M Pretrain 595K is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Captions are also associated with BLIP synthetic caption for reference. It is constructed for the pretraining stage for feature alignment in visual instruction tuning.	LLaVA-CC3M-Pretrain-595K Info	-
MNIST	✅	`/scratch-nvme/ml-datasets/MNIST`	-	CC BY-SA License	MNIST is an image database of 70k grayscale handwritten digits under 10 categories (0 to 9) with a fixed resolution 28x28.	MNIST Info	55MB
MSMARCO	✅	`/projects/2/managed_datasets/MSMARCO`		CC-4	MS MARCO is a large-scale dataset focused on machine reading comprehension, incl. question-answering, passage ranking, document ranking, keyphrase extraction, and conversational search.	MSMARCO	500GB
Multilingual Librispeech (Dutch only)	✅	`/projects/2/managed_datasets/multilingual_librispeech`		CC BY 4.0	Multilingual LibriSpeech (MLS) dataset is a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox. NB: this is currently only the Dutch partition of the dataset (full dataset also consists of English, German, Spanish, French, Italian, Portuguese, Polish).	MLS	88GB
Places2	✅	`/projects/2/managed_datasets/places2`			Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence.	Places2	110GB
QReCC	✅	`/projects/2/managed_datasets/QReCC`		Apache 2.0	QReCC (Question Rewriting in Conversational Context), an end-to-end open-domain question answering dataset comprising of 14K conversations with 81K question-answer pairs.	QReCC - Question Rewriting in Conversational Context	30GB
Segment Anything 1B	❌	`/projects/2/managed_datasets/sa-1b`		Custom license, see https://ai.meta.com/datasets/segment-anything-downloads/	The Segment-Anything 1B dataset (SA-1B) is a large-scale collection of over 1 billion image masks, designed for training and evaluating segmentation models. It provides diverse and high-quality segmentation data across a wide range of visual scenes, enabling advancements in computer vision tasks like object recognition and image understanding.	https://ai.meta.com/datasets/segment-anything/	11TB
STL10	✅	`/scratch-nvme/ml-datasets/stl10`	-	-	STL10 is an image database consisting of 60k 96x96 color images for image classification.	STL10 Info	2.5GB
TopiOCQA	✅	`/projects/2/managed_datasets/topiocqa`	-	CC-BY-CA license	information-seeking conversational dataset with challenging topic switching phenomena	topiocqa	7GB
Vox Populi	✅	`/projects/2/managed_datasets/voxpopuli`	v1	CC0 (data) / CC BY-CA 4.0 (code)	A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. The raw data is collected from 2009-2020 European Parliament event recordings. NOTE: hosted dataset currently only contains English (en) and Dutch (nl) splits of the transcribed data and the Dutch split of the raw data, as well as the raw unsegmented audio clips which can be segmented and split using the Voxpopuli code.	Vox Populi	150GB
Waymo - perception	❌	`/projects/2/managed_datasets/waymo`	1.4.3	Waymo Open terms	The Waymo Open Dataset is comprised of high resolution sensor data collected by autonomous vehicles operated by the Waymo Driver in a wide variety of conditions. We store the perception part of the dataset, which consists of high resolution sensor data and labels for 2,030 segments.	Waymo Open Dataset	1.1T
YFCC100M	✅	`/projects/2/managed_datasets/yfcc100m/tars`	-	Various forms of CC	The YFCC100M (Yahoo Flickr Creative Commons 100 Million) dataset is a large-scale collection of 100 million media objects, including photos and videos, sourced from Flickr under Creative Commons licenses. It contains rich metadata such as tags, timestamps, and geolocation, making it valuable for research in computer vision, multimedia analysis, and machine learning.	YFCC100M	12TB
YFCC15M	✅	`/projects/2/managed_datasets/yfcc15m/tars`	-	Various forms of CC	The YFCC15M dataset is a subset of the YFCC100M, consisting of 15 million images specifically curated for deep learning research. It retains the same diverse metadata, with a focus on high-quality, representative samples from the larger dataset, facilitating tasks like image classification and object detection. Due to the way the dataset is downloaded and filtered, some objects in the root folder are still called "yfcc100m", however these contain exclusively yfcc15m objects.	YFCC15M description Filtering code	1.8TB

Space shortcuts

Page tree

Dataset or model not listed?

Getting access to restricted datasets and models

How do I load a hf_cache_dir model or dataset?

Available Models

Available Datasets

Space shortcuts

Page tree

Available datasets and models on Snellius

Dataset or model not listed?

Getting access to restricted datasets and models

How do I load a hf_cache_dir model or dataset?

Available Models

Available Datasets