
Open Datasets
We provide various datasets owned by MMLab. The datasets with annotations are now easy to access here at OpenMMLab.
Content Filter
Datasets
face
DeeperForensics-1.0 Dataset
DeeperForensics-1.0, a new dataset for real-world face forgery detection, features three appealing properties: good quality, large scale, and high diversity. The full dataset includes 48,475 source videos and 11,000 manipulated videos, an order of magnitude larger than existing datasets. The source videos are carefully collected on 100 paid and consented actors from 26 countries, and the manipulated videos are generated by a newly proposed many-to-many end-to-end face swapping method, DF-VAE. Besides, 7 types of real-world perturbations at 5 intensity levels are employed to ensure a larger scale and higher diversity. We will provide some detailed information as follows.
This dataset does not support anonymous download, please enter the project page for downloading.
action
FineGym
On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take action recognition to a new level, we develop FineGym, a new dataset built on top of gymnasium videos. Compared to existing action recognition datasets, FineGym is distinguished in richness, quality, and diversity. In particular, it provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. For example, a "balance beam" event will be annotated as a sequence of elementary sub-actions derived from five sets: "leap-jumphop", "beam-turns", "flight-salto", "flight-handspring", and "dismount", where the sub-action in each set will be further annotated with finely defined class labels. This new level of granularity presents significant challenges for action recognition, e.g. how to parse the temporal structures from a coherent action, and how to distinguish between subtly different action classes. We systematically investigate representative methods on this dataset and obtain a number of interesting findings. We hope this dataset could advance research towards action understanding.
This dataset does not support anonymous download, please enter the project page for downloading.
person
language
place
action
MovieNet
MovieNet is a holistic dataset for comprehensive movie understanding. In MovieNet, we provide:
• Massive data, including 1100 movies, 60K trailers, 375K meta data, etc.
• Various annotations, including character bounding boxes and IDs; cinematic styles, i.e. shot scale and shot movement; scene temporal boundaries; action and place tags; movie synopsis association, etc.
• Massive data, including 1100 movies, 60K trailers, 375K meta data, etc.
• Various annotations, including character bounding boxes and IDs; cinematic styles, i.e. shot scale and shot movement; scene temporal boundaries; action and place tags; movie synopsis association, etc.
This dataset does not support anonymous download, please enter the project page for downloading.
22.625GB
detection
MessyTable
MessyTable features a large number of scenes with messy tables captured from multiple camera views. Each scene in this dataset is highly complex, containing multiple object instances that could be identical, stacked and occluded by other instances. The key challenge is to associate all instances given the RGB image of all views. The dataset challenges existing methods inmining subtle appearance differences, reasoning based on contexts, and fusing appearance with geometric cues for establishing an association.
place
Placepedia
Placepedia dataset contains 240K places with 35M images from all over the world. Each place is associated with its district, city/town/village, state/province, country, continent, and a large amount of diverse photos. Both administrative areas and places have rich side information, e.g. discription, population, category, function.
action
TAPOS
Current methods for action recognition primarily rely on deep convolutional networks to derive feature embeddings of visual and motion features. While these methods have demonstrated remarkable performance on standard benchmarks, we are still in need of a better understanding as to how the videos, in particular their internal structures, relate to high-level semantics, which may lead to benefits in multiple aspects, e.g. interpretable predictions and even new methods that can take the recognition performances to a next level. Towards this goal, we construct TAPOS, a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them. On the constructed TAPOS, the proposed method is shown to reveal intra-action information, i.e. how action instances are made of sub-actions, and inter-action information, i.e. one specific sub-action may commonly appear in various actions.
This dataset does not support anonymous download, please enter the project page for downloading.
42.449GB
segmentation
CULane
CULane is a large scale challenging dataset for academic research on traffic lane detection. It is collected by cameras mounted on six different vehicles driven by different drivers in Beijing. More than 55 hours of videos were collected and 133,235 frames were extracted. Data examples are shown above. We have divided the dataset into 88880 for training set, 9675 for validation set, and 34680 for test set.
person
DeepFashion Dataset
We contribute DeepFashion database, a large-scale clothes database, which has several appealing properties: First, DeepFashion contains over 800,000 diverse fashion images ranging from well-posed shop images to unconstrained consumer photos. Second, DeepFashion is annotated with rich information of clothing items. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks. Third, DeepFashion contains over 300,000 cross-pose/cross-domain image pairs.
This dataset does not support anonymous download, please enter the project page for downloading.
others
FashionGAN Dataset
New annotations (languages and segmentation maps) on the subset of the DeepFashion dataset. The data is used in our ICCV 2017 paper "Be Your Own Prada: Fashion Synthesis with Structural Coherence".
This dataset does not support anonymous download, please enter the project page for downloading.
6.955GB
pose
kinetics-skeleton
It is going to be an dataset for skeleton-based human understanding, including but not limited to pose estimation, action recognition and skeleton sequence generation. Skeletion from Kinectics are provided by this dataset.
3.417GB
others
Web Image Dataset for Event Recognition (WIDER)
WIDER is a dataset for complex event recognition from static images. As of v0.1, it contains 61 event categories and around 50574 images annotated with event class labels. We provide a split of 50% for training and 50% for testing.
173.408GB
face
re-id
person
WIDER 2019
The dataset centers around the problem of precise localization of human faces and bodies, and accurate association of identities. It comprises of four parts:
-- WIDER Face Detection , aims at soliciting new approaches to advance the state-of- the-art in face detection.
-- WIDER Pedestrian Detection , has the goal of gathering effective and efficient approaches to address the problem of pedestrian detection in unconstrained environments.
-- WIDER Cast Search by Portrait , presents an exciting challenge of searching cast across hundreds of movies.
-- WIDER Person Search by Language , aims to seek new approaches to search person by natural language.
-- WIDER Face Detection , aims at soliciting new approaches to advance the state-of- the-art in face detection.
-- WIDER Pedestrian Detection , has the goal of gathering effective and efficient approaches to address the problem of pedestrian detection in unconstrained environments.
-- WIDER Cast Search by Portrait , presents an exciting challenge of searching cast across hundreds of movies.
-- WIDER Person Search by Language , aims to seek new approaches to search person by natural language.
3.419GB
face
WIDER ATTRIBUTE Dataset
WIDER ATTRIBUTE dataset is a human attribute recognition benchmark dataset, of which images are selected from the publicly available WIDER dataset. There are a total of 13789 images. We annotate a bounding box for each person in these images, but no more than 20 people (with top resolutions) in a crowd image, resulting in 57524 boxes in total and 4+ boxes per image on average. For each bounding box, we label 14 distinct human attributes, resulting in a total of 805336 labels.
3.424GB
face
WIDER FACE Dataset
WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.
3.665GB
language
detection
WildLife Documentary (WLD) Dataset
The dataset contains 15 documentary films that are downloaded from YouTube, whose durations vary from 9 minutes to as long as 50 minutes, and the total number of frames is more than 747,000. More than 4000 object tracklets of 65 categories are annotated.
84.517MB
face
CUHK Face Sketch FERET Database (CUFSF)
CUHK Face Sketch FERET Database (CUFSF) is for research on face sketch synthesis and face sketch recognition. It includes 1,194 persons from the FERET database [8]. For each person, there are a face photo with lighting variation and a sketch with shape exaggeration drawn by an artist when viewing this photo.
76.816MB
low-level
CUHK Image Cropping Dataset
Image cropping is a common operation used to improve the visual quality of photographs This dataset was used in this paper, which presents an approach for automatic image cropping. The photos are of varying aesthetic quality and span a variety of image categories, including animal, architecture, human, landscape, night, plant and man-made objects. Each image is manually cropped by three expert photographers (graduate students in art whose primary medium is photography) to form three training sets.
7.56GB
face
Expression in-the-Wild (ExpW) Dataset
We built a new database named as Expression in-the-Wild (ExpW) dataset that contains 91,793 faces manually labeled with expressions. Each of the face images was manually annotated as one of the seven basic expression categories: “angry”, “disgust”, “fear”, “happy”, “sad”, “surprise”, or “neutral”. The number of images in ExpW is larger and the face variations are more diverse than many existing databases. The dataset is used in our paper "From Facial Expression Recognition to Interpersonal Relation Prediction".
100.551MB
low-level
General 100 Dataset
General-100 dataset contains 100 bmp-format images (with no compression). We used this dataset in our FSRCNN ECCV 2016 paper. The size of these 100 images ranges from 710 x 704 (large) to 131 x 112 (small). They are all of good quality with clear edges but fewer smooth regions (e.g., sky and ocean), thus are very suitable for the super-resolution training.
3.035GB
person
re-id
LPW
The dataset is collected in three different crowed scenes. In the first scene there are three cameras placed, four cameras in the other two scenes. During collection, the cameras with the same parameters set were placed at the two junctions of the street. Labeled Pedestrian in the Wild consists of 2,731 different pedestrians and we make sure that each annotated identity is captured by at least two cameras, so that cross-camera search can be performed. A total of 7,694 image sequences are generated with an average of 77 frames per sequence.
15.5M
others
MIT Trajectory Dataset (Single Camera)
MIT trajectory data set is for the research of activity analysis in a single camera view using the trajectories of objects as features. Object tracking is based on background subtraction using a Adaptive Gaussian Mixture model. There are totally 40,453 trajectories obtained from a parking lot scene within five days. Some short trajectories have been filtered.
143.197MB
face
Multi-Task Facial Landmark (MTFL) Dataset
The dataset is used in our ECCV paper for training a multi-task deep model of facial landmark detection. It consists 12,995 face images, each of which is annotated with bounding box and five landmarks, i.e. centers of the eyes, nose, corners of the mouth. In addition, it includes related tasks annotations, including 'smiling', 'wearing glasses', 'gender', and 'head pose'.
52.523MB
person
Pedestrian Color Naming Dataset
To facilitate the learning of evaluation of pedestrian color naming, we build a new large-scale dataset, named Pedestrian Color Naming (PCN) dataset, which contains 14,213 images, each of which hand-labeled with color label for each pixel. All images in the PCN dataset are obtained from the Market- 1501 dataset.
720.383MB
others
Social Relation Dataset
We define the social relation traits based on the interpersonal circle proposed by Kiesler, where human relations are divided into 16 segments Each segment has its opposite side in the circle, such as 'friendly and hostile'. To investigate the detectability of social relations from a pair of face images, we build a new dataset, containing 8,306 images chosen from web and movies. Each image is labelled with faces’ bounding boxes and their pairwise relations. This is the first face dataset measuring social relation traits and it is challenging because of large face variations including poses, occlusions, and illuminations.
628.139KB
re-id
The Comprehensive Cars (CompCars) dataset
The Comprehensive Cars (CompCars) dataset contains data from two scenarios, including images from web-nature and surveillance-nature. The web-nature data contains 163 car makes with 1,716 car models. There are a total of 136,726 images capturing the entire cars and 27,618 images capturing the car parts. The full car images are labeled with bounding boxes and viewpoints. Each car model is labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car. The surveillance-nature data contains 50,000 car images captured in the front view. Please refer to our paper for the details.
1.2MB
others
Visual Discriminative Question Generation (VDQG) Dataset
The dataset contains 11202 ambiguous image pairs collected from Visual Genome. Each image pair is annotated with 4.6 discriminative questions and 5.9 non-discriminative questions on average. The dataset is used in our ICCV 2017 paper "Learning to Disambiguate by Asking Discriminative Questions".
39G
person
re-id
WWW Crowd Dataset
The largest crowd dataset with crowd attributes annotations - We establish a large-scale crowd dataset with 10,000 videos from 8,257 scenes. 94 crowd-related attributes are designed and annotated to describe each video in the dataset. It is the first time such a large set of attributes on crowd understanding is defined.