Jeux de données « Open Data » pour le Deep Learning et le Machine Learning

Vous trouverez ici une liste triée de jeux de données (datasets – en anglais) que vous pouvez utiliser comme vous le souhaitez pour vos projets de data science (réseaux de neurones, machine learning, etc..). Si vous en cherchez d’autres, je vous invite à consulter également la page Wikipédia qui en regorge :


Jeux de données d’image :

  • MNIST: handwritten digits: The most commonly used sanity check. Dataset of 25×25, centered, B&W handwritten digits. It is an easy task — just because something works on MNIST, doesn’t mean it works.
  • CIFAR10 / CIFAR100: 32×32 color images with 10 / 100 categories. Not commonly used anymore, though once again, can be an interesting sanity check.
  • Caltech 101: Pictures of objects belonging to 101 categories.
  • Caltech 256: Pictures of objects belonging to 256 categories.
  • STL-10 dataset: is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Like CIFAR-10 with some modifications.
  • The Street View House Numbers (SVHN): House numbers from Google Street View. Think of this as recurrent MNIST in the wild.
  • NORB: Binocular images of toy figurines under various illumination and pose.
  • Pascal VOC: Generic image Segmentation / classification — not terribly useful for building real-world image annotation, but great for baselines
  • Labelme: A large dataset of annotated images.
  • ImageNet: The de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category; WordNet; hierarchy from ImageNet.
  • LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
  • MS COCO: Generic image understanding / captioning, with an associated competition.
  • COIL 20: Different objects imaged at every angle in a 360 rotation.
  • COIL100 : Different objects imaged at every angle in a 360 rotation.
  • Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.


Données géo-spatiales :

  • OpenStreetMap: Vector data for the entire planet under a free license. It contains (an older version of) the US Census Bureau’s data.
  • Landsat8: Satellite shots of the entire Earth surface, updated every several weeks.
  • NEXRAD: Doppler radar scans of atmospheric conditions in the US.


Jeux de données faciales

  • Labelled Faces in the Wild: 13,000 cropped facial regions (using; Viola-Jones that have been labeled with a name identifier. A subset of the people present have two images in the dataset — it’s quite common for people to train facial matching systems here.
  • UMD Faces Annotated dataset of 367,920 faces of 8,501 subjects.
  • CASIA WebFace Facial dataset of 453,453 images over 10,575 identities after face detection. Requires some filtering for quality.
  • MS-Celeb-1M 1 million images of celebrities from around the world. Requires some filtering for best results on deep networks.
  • Olivetti: A few images of several different people.
  • Multi-Pie: The CMU Multi-PIE Face Database
  • Face-in-Action
  • JACFEE: Japanese and Caucasian Facial Expressions of Emotion
  • FERET: The Facial Recognition Technology Database
  • mmifacedb: MMI Facial Expression Database
  • IndianFaceDatabase
  • The Yale Face Database and The Yale Face Database B).
  • [Mut1ny Face/Head segmentation dataset] ( Over 16k pixel-level segmented images of faces/head images


Jeux de données vidéos

  • Youtube-8M: A large and diverse labeled video dataset for video understanding research.


Jeux de données textuels (textes)

  • 20 newsgroups: Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
  • Reuters News dataset: (Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.
  • Penn Treebank: Used for next word prediction or next character prediction.
  • UCI’s Spambase: (Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.
  • Broadcast News: Large text dataset, classically used for next word prediction.
  • Text Classification Datasets: From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo! and AG.
  • WikiText: A large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
  • SQuAD: The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a segment of text.
  • Billion Words dataset: A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.
  • Common Crawl: Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s a crawl of the WWW.
  • Google Books Ngrams: Successive words from Google books. Offers a simple method to explore when a word first entered wide usage.
  • Yelp Open Dataset: The Yelp dataset is a subset of Yelp businesses, reviews, and user data for use in NLP.