Dataset | Free access | Path on Snellius | Available versions | License | Description | Website | Size |
---|
ADE20K | ✅ | /projects/2/managed_datasets/ADE20K | 23-02-2024 | ADE20K License | ADE20K is composed of more than 27K images from the SUN and Places databases. Images are fully annotated with objects, spanning over 3K object categories. Many of the images also contain object parts, and parts of parts. The original annotated polygons are also provided, as well as object instances for amodal segmentation. Images are also anonymized, blurring faces and license plates. | ADE20K Website | 2.3GB |
AlphaFold | ✅ | /projects/2/managed_datasets/AlphaFold | 2.3.1 | Apache 2.0 | AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiment. | AlphaFold Info | 2.62TB |
BDD100k (Berkeley Deep Drive 100k) | ❌ | /scratch-nvme/ml-datasets/bdd100k | - | BSD 3-Clause License | BDD100K is a diverse driving dataset for heterogeneous multitask learning. | BDD100K Website | 2TB |
CC3M ⚠️ INCOMPLETE ⚠️ See dataset description | ✅ | /projects/2/managed_datasets/ConceptualCaptions | - | - | CC3M is a dataset of 3 million image and caption. Downloading CC3M is not possible due to the way it is setup: it is provided as a list of URLs, many of which are broken links. Should you have a complete mirror or duplicate of this dataset, please contact us. An available subset of the dataset is provided as-is, please see /projects/2/managed_datasets/ConceptualCaptions/downloaded_report.ipynb on the status of this dataset. Alternatively: use CC12M, which is complete. | CC3M Website | 2GB |
CC12M | ✅ | /projects/2/managed_datasets/CC12M | - | - | Conceptual 12M (CC12M), a dataset with ~12 million image-text pairs meant to be used for vision-and-language pre-training. It is larger and covers a much more diverse set of visual concepts than CC3M. | CC12M info | 2GB |
CIFAR10 | ✅ | /scratch-nvme/ml-datasets/cifar-10 | - | - | CIFAR10 is an image database consisting of 60k 32x32 color images for image classification. | CIFAR10 Info | 162MB |
CIFAR100 | ✅ | /scratch-nvme/ml-datasets/cifar-100 | - | - | CIFAR10 is an image database consisting of 60k 32x32 color images for image classification. | CIFAR Info | 162MB |
Cityscapes | ✅ | /scratch-nvme/ml-datasets/cityscapes | - | Cityscapes License | Cityscapes is a large-scale dataset of stereo street video sequences with 5000 pixel-level annotations and 20k 'weak' annotations. Its primary purpose is to assess semantic segmentation on scene understanding (pixel-level, instance-level, and panoptic). | Cityscapes Info | 1.9TB |
co3d | ✅ | /projects/2/managed_datasets/co3d | - | License | co3d (Common Objects in 3D) is a dataset designed for 3D object recognition and reconstruction. It contains multi-view images of common objects with annotations for 3D object reconstruction tasks. The dataset includes 3D models, camera parameters, and segmentation masks. | co3d info | 5.5TB |
COCO (Microsoft Common Objects in Context) | ✅ | /projects/2/managed_datasets/COCO | 2017 | - | MS Coco dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K color images. Most benchmarks are reported on the COCO 2017 images. | COCO Website | 46GB |
fastMRI | ✅ | /scratch-nvme/ml-datasets/fastmri | | fastMRI dataset agreement | fastMRI is a collaborative research project from Facebook AI Research (FAIR) and NYU Langone Health to investigate the use of AI to make MRI scans faster. NYU Langone Health has released fully anonymized knee and brain MRI datasets that can be downloaded from the fastMRI dataset page. Publications associated with the fastMRI project can be found at the end of this README | fastMRI Website | 4.2TB |
FFHQ | ✅ | /projects/2/managed_datasets/FFHQ | | - | Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN)
We only have the 1024x1024 images. | https://github.com/NVlabs/ffhq-dataset | 90GB
|
FineWeb Edu | ✅ | (when using Huggingface): /projects/2/managed_datasets/hf_cache_dir/ | | - | FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from FineWeb dataset. This is the 1.3 trillion version. | FineWeb Edu Info | 8TB |
GigaCorpus | ✅ | /projects/2/managed_datasets/GigaCorpus | v1 March 2023 | - | With 234GB of varied plaintext, as much as 40 billion tokens, this is at least the largest Dutch corpus. But in addition this corpus is also freely available and the quality is relatively high for its size, care has been taken to get the data as clean as possible. Also, the corpus contains 400 million forum posts in 10 million threads with their timestamp intact for linguistic research. | GigaCorpus Info | 500GB
|
HYPFLOWSCI6 | ✅ | /projects/2/managed_datasets/hypflowsci6_v1.0 | V1.0 | GPLv3 | The datapackage HYPFLOWSCI6 (HYdrological Projection of Future gLObal Water States with CMIP6) contains a simulation dataset of global hydrology and water resource conditions covering the historical/past years from 1960 to the future projected period until 2100. The dataset has 5 arc-minute spatial resolution (about 10 km at the equator) and monthly temporal resolution. | HYPFLOWSCI6 Info | 80GB |
ImageNet | ❌ | /scratch-nvme/ml-datasets/imagenet /scratch-nvme/ml-datasets/imagenet21k
| | ImageNet License | ImageNet is a famous image database of various resolutions for image classification collected from Flickr and other external websites. | ImageNet Info | |
Kinetics | ✅ | /scratch-nvme/ml-datasets/kinetics | kinetics 700-2020 | - | Kinetics is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400/600/700 video clips. Each clip is human annotated with a single action class and lasts around 10 seconds. | Kinetics Info | |
KITTI | ✅ | - | - | KITTI License | KITTI is an image/video dataset from traffic scenarios for computer vision tasks like stereo, optical flow, visual odometry, 3D object detection 3D tracking and semantic segmentation (without annotations). | - | - |
LAION400M | ✅ | /projects/2/managed_datasets/laion400m | - | - | LAION-400M is a dataset with CLIP-filtered 400 million image-text pairs, their CLIP embeddings and kNN indices that allow efficient similarity search. | Laion 400M info | 10TB |
LLaVA-CC3M-Pretrain-595K | ✅ | - Virtual path (when using Huggingface):
/projects/2/managed_datasets/hf_cache_dir/ - Real path (raw images):
/projects/2/managed_datasets/hf_cache_dir/downloads/extracted/30814bc1b79e86b8e7ef21b088d25da3ba559b0b6a36848dfd9ff92e75a62604
| - | - | LLaVA Visual Instruct CC3M Pretrain 595K is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Captions are also associated with BLIP synthetic caption for reference. It is constructed for the pretraining stage for feature alignment in visual instruction tuning. | LLaVA-CC3M-Pretrain-595K Info | - |
MNIST | ✅ | /scratch-nvme/ml-datasets/MNIST | - | CC BY-SA License | MNIST is an image database of 70k grayscale handwritten digits under 10 categories (0 to 9) with a fixed resolution 28x28. | MNIST Info | 55MB |
STL10 | ✅ | /scratch-nvme/ml-datasets/stl10 | - | - | STL10 is an image database consisting of 60k 96x96 color images for image classification. | STL10 Info | 2.5GB |
MSMARCO | ✅ | /projects/2/managed_datasets/MSMARCO
|
| CC-4 | MS MARCO is a large-scale dataset focused on machine reading comprehension, incl. question-answering, passage ranking, document ranking, keyphrase extraction, and conversational search. | MSMARCO | 500GB |
QReCC | ✅ | /projects/2/managed_datasets/QReCC
|
| Apache 2.0 | QReCC (Question Rewriting in Conversational Context), an end-to-end open-domain question answering dataset comprising of 14K conversations with 81K question-answer pairs. | QReCC - Question Rewriting in Conversational Context | 30GB |
TopiOCQA | ✅ | /projects/2/managed_datasets/topiocqa
| - | CC-BY-CA license | information-seeking conversational dataset with challenging topic switching phenomena | topiocqa | 7GB |
YFCC100M | ✅ | /projects/2/managed_datasets/yfcc100m/tars
| - | Various forms of CC | The YFCC100M (Yahoo Flickr Creative Commons 100 Million) dataset is a large-scale collection of 100 million media objects, including photos and videos, sourced from Flickr under Creative Commons licenses. It contains rich metadata such as tags, timestamps, and geolocation, making it valuable for research in computer vision, multimedia analysis, and machine learning. | YFCC100M | 12TB |
YFCC15M | ✅ | /projects/2/managed_datasets/yfcc15m/tars
| - | Various forms of CC | The YFCC15M dataset is a subset of the YFCC100M, consisting of 15 million images specifically curated for deep learning research. It retains the same diverse metadata, with a focus on high-quality, representative samples from the larger dataset, facilitating tasks like image classification and object detection. Due to the way the dataset is downloaded and filtered, some objects in the root folder are still called "yfcc100m", however these contain exclusively yfcc15m objects. | | 1.8TB |