Table of Contents[Hide][Show]
- 1. CelebFaces Attributes Dataset
- 2. DOTA
- 3. Google Facial Expression comparison dataset
- 4. Visual Genome
- 5. LibriSpeech
- 6. The Cityspaces
- 7. Kinetics Dataset
- 8. CelebAMask-HQ
- 9. Penn Treebank
- 10. VoxCeleb
- 11. SIXray
- 12. US Accidents
- 13. Ocular Disease Recognition
- 14. Heart Disease
- 15. CLEVR
- 16. Universal Dependencies
- 17. KITTI – 360
- 18. MOT(Multiple Object Tracking)
- 19. PASCAL 3D+
- 20. Facial Deformable Models of Animals
- 21. MPII Human Post Dataset
- 22. UCF101
- 23. Audioset
- 24. Stanford Natural Language Inference
- 25. Visual Question Answering
- Conclusion
Nowadays, most of us are focused on developing machine learning and AI models and addressing issues using current datasets. But first, we must define a dataset, its significance, and its role in developing strong AI and ML solutions.
Today, we have a plethora of open-source datasets on which to conduct research or develop applications to tackle real-world issues in a variety of sectors.
However, the scarcity of high-quality quantitative datasets is a source of worry. Data has risen immensely and will continue to expand at a faster rate in the future.
In this post, we will cover freely available datasets that you can utilize to develop your next AI project.
1. CelebFaces Attributes Dataset
CelebFaces Attributes Dataset (CelebA) contains over 200K celebrity photos and 40 attribute annotations for each image, making it an excellent starting point for projects such as face recognition, face detection, landmark (or facial component) localization, and face editing & synthesis. Furthermore, the photos in this collection contain a wide range of position variants and backdrop clutter.
2. DOTA
DOTA (Dataset of Object Detection in Aerial Photos) is a large-scale dataset for object detection that includes 15 common categories (e.g., ship, plane, car, etc.), 1411 images for training, and 458 images for validation.
3. Google Facial Expression comparison dataset
The Google facial expression comparison dataset contains around 500,000 picture triplets, including 156,000 face photos. It’s worth noting that each triplet in this dataset was annotated by at least six human raters.
This dataset is useful for projects involving face expression analysis, such as expression-based picture retrieval, emotion categorization, expression synthesis, and so on. To gain access to the dataset, a brief form must be completed.
4. Visual Genome
Visual Question Answering data in a multi-choice environment is available in Visual Genome. It is made up of 101,174 MSCOCO photos with 1.7 million QA pairs, with an average of 17 questions per image.
In comparison to the Visual Question Answering dataset, the Visual Genome dataset has a more fair distribution across six question types: What, Where, When, Who, Why, and How.
In addition, the Visual Genome dataset includes 108K photos that have been heavily tagged with objects, properties, and connections.
5. LibriSpeech
The LibriSpeech corpus is a collection of around 1,000 hours of audiobooks from the LibriVox project. The majority of the audiobooks originate from Project Gutenberg.
The training data is divided into three partitions of 100hr, 360hr, and 500hr sets, while the dev and test data are roughly 5hr in audio length.
6. The Cityspaces
One of the most well-known large-scale databases of stereo videos with urban views is called The Cityscapes.
With pixel-accurate annotations that include GPS locations, the outdoor temperature, ego-motion data, and right stereo perspectives, it includes recordings from 50 distinct German cities.
7. Kinetics Dataset
One of the most well-known video datasets for recognizing human activity on a big scale and with good quality is the Kinetics dataset. There are at least 600 video clips for each of the 600 human activity classes, totaling over 500,000 in total.
The films were pulled from YouTube; each one is around 10 seconds long and has only one activity class listed.
8. CelebAMask-HQ
CelebAMask-HQ is a collection of 30,000 high-resolution face photos with carefully annotated masks and 19 classes that include facial components like as skin, nose, eyes, brows, ears, mouth, lip, hair, hat, eyeglass, earring, necklace, neck, material.
The dataset can be utilized to test and train face recognition, face parsing, and GANs for face-generating and editing algorithms.
9. Penn Treebank
One of the most notable and often used corpora for the assessment of models for sequence tagging is the English Penn Treebank (PTB) corpus, in particular the portion of the corpus corresponding to Wall Street Journal articles.
Each word must have its part of speech tagged as a component of the task. Character-level and word-level language modeling also frequently uses the corpus.
10. VoxCeleb
VoxCeleb is a large-scale speech identification dataset generated automatically from open-source media. VoxCeleb has over a million utterances from over 6k speakers.
As the dataset includes audio-visual, it can be used for a variety of additional applications, including visual speech synthesis, speech separation, cross-modal transfer from face to voice or vice versa, and training face recognition from video to supplement current face recognition datasets.
11. SIXray
The SIXray dataset includes 1,059,231 X-ray pictures gathered from subway stations and annotated by human security inspectors to detect six main kinds of forbidden items: pistols, knives, wrenches, pliers, scissors, and hammers. Furthermore, bounding boxes for each disallowed item have been manually added to the testing sets in order to evaluate the performance of object localization.
12. US Accidents
The project’s substance is already revealed by the name of the dataset, US Accidents. This dataset on nationwide automobile accidents includes information from February 2016 to December 2021 and covers 49 states in the USA.
Approximately 1.5 million accident records are now present in this collection. It was gathered in real-time by utilizing several traffic APIs.
These APIs transmit traffic information gathered from a variety of sources, including traffic cameras, law enforcement organizations, and the US and state departments of transportation.
13. Ocular Disease Recognition
The organized ophthalmic database Ocular Disease Intelligent Recognition (ODIR) contains information on 5,000 patients, including their age, the color of the fundus in their left and right eyes, and medical professionals’ diagnostic keywords.
This dataset is an actual collection of patient data from various hospitals and medical facilities in China that Shanggong Medical Technology Co., Ltd. has acquired. With quality control management, annotations were tagged by skilled human readers.
14. Heart Disease
This Heart disease dataset assists in identifying the existence of heart disease in a patient based on 76 parameters such as age, gender, chest pain kind, resting blood pressure, and so on.
With 303 cases, the database seeks to simply differentiate the existence of an illness (value 1,2,3,4) from its absence (value 0).
15. CLEVR
The CLEVR dataset (Compositional Language and Elementary Visual Reasoning) mimics Visual Question Answering. It consists of photographs of 3D-rendered objects, with each photograph accompanied by a series of highly compositional questions divided into several categories.
For all train and validation pictures and questions, the dataset comprises 70,000 photographs and 700,000 questions for training, 15,000 images and 150,000 questions for validation, and 15,000 images and 150,000 questions for testing involving objects, replies, scene graphs, and functional programs.
16. Universal Dependencies
The Universal Dependencies (UD) project aims to create cross-linguistically uniform morphology and syntax treebank annotation for many languages. Version 2.7, which was released in 2020, has 183 treebanks in 104 languages.
The annotation is made up of universal POW tags, dependence heads, and universal dependency labels.
17. KITTI – 360
One of the most often used datasets for mobile robots and autonomous driving is KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute).
It is made up of hours’ worth of traffic scenarios that were captured using a range of sensor modalities, such as high-resolution RGB, grayscale stereo, and 3D laser scanner cameras. The dataset has been improved over time by several researchers who manually annotated various portions of it to suit their needs.
18. MOT(Multiple Object Tracking)
MOT (Multiple Object Tracking) is a dataset for multiple object tracking that includes indoor and outdoor sceneries of public locations that include pedestrians as the objects of interest. Each scene’s video is broken into two pieces, one for training and the other for testing.
The dataset includes object detections in video frames using three detectors: SDP, Faster-RCNN, and DPM.
19. PASCAL 3D+
The Pascal3D+ multi-view dataset is made up of photographs collected in the wild, i.e., images of item categories with high variability, captured in uncontrolled circumstances, in crowded environments, and in a variety of positions. Pascal3D+ includes 12 rigid object categories drawn from the PASCAL VOC 2012 dataset.
These items have posture information marked on them (azimuth, elevation, and distance to the camera). Pascal3D+ additionally includes pose-annotated photos from the ImageNet collection in these 12 categories.
20. Facial Deformable Models of Animals
The goal of the Facial Deformable Models of Animals (FDMA) project is to challenge current methodologies in human facial landmark identification and tracking and to develop new algorithms that can deal with the considerably bigger variability that is characteristic of animal facial characteristics.
The project’s algorithms demonstrated the ability to recognize and track landmarks on human faces while dealing with variances induced by changes in facial emotions or positions, partial occlusions, and lighting.
21. MPII Human Post Dataset
The MPII Human Pose Dataset contains around 25K photos, 15K of which are training samples, 3K of which are validation samples, and 7K of which are testing samples.
The positions are manually labeled with up to 16 bodily joints, and the photographs are taken from YouTube films covering 410 various human activities.
22. UCF101
The UCF101 dataset contains 13,320 video clips organized into 101 categories. These 101 categories are divided into five categories: bodily movements, human-human interactions, human-object interactions, musical instrument playing, and sports.
The videos are from YouTube and comprise 27 hours in duration.
23. Audioset
Audioset is an audio event dataset made up of over 2 million human-annotated 10-second video segments. To annotate this data, a hierarchical ontology comprising 632 event types is used, which implies that the same sound might be labeled differently.
24. Stanford Natural Language Inference
The SNLI dataset (Stanford Natural Language Inference) contains 570k sentence pairings that have been manually categorized as entailment, contradiction, or neutral.
Premises are Flickr30k picture descriptions, while hypotheses were developed by crowd-sourced annotators who were provided a premise and instructed to generate entailing, contradicting, and neutral statements.
25. Visual Question Answering
Visual Question Answering (VQA) is a dataset that contains open-ended questions regarding pictures. To answer these questions, you need to grasp vision, language, and common sense.
Conclusion
As machine learning and artificial intelligence (AI) become more prevalent in practically every business and in our daily lives, so does the number of resources and information available on the subject.
Ready-made public datasets provide a great starting point to develop AI models while also allowing seasoned ML programmers to save time and focus on other elements of their projects.

Leave a Reply