mmcls.datasets¶

The datasets package contains several usual datasets for image classification tasks and some dataset wrappers.

Custom Dataset¶

class mmcls.datasets.CustomDataset(data_prefix: str, pipeline: Sequence = (), classes: Optional[Union[str, Sequence[str]]] = None, ann_file: Optional[str] = None, extensions: Sequence[str] = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif'), test_mode: bool = False, file_client_args: Optional[dict] = None)[source]¶

Custom dataset for classification.

The dataset supports two kinds of annotation format.

An annotation file is provided, and each line indicates a sample:

The sample files:

data_prefix/
├── folder_1
│   ├── xxx.png
│   ├── xxy.png
│   └── ...
└── folder_2
    ├── 123.png
    ├── nsdf3.png
    └── ...

The annotation file (the first column is the image path and the second column is the index of category):

folder_1/xxx.png 0
folder_1/xxy.png 1
folder_2/123.png 5
folder_2/nsdf3.png 3
...

Please specify the name of categories by the argument classes.

The samples are arranged in the specific way:

data_prefix/
├── class_x
│   ├── xxx.png
│   ├── xxy.png
│   └── ...
│       └── xxz.png
└── class_y
    ├── 123.png
    ├── nsdf3.png
    ├── ...
    └── asd932_.png

If the ann_file is specified, the dataset will be generated by the first way, otherwise, try the second way.

Parameters

data_prefix (str) – The path of data directory.
pipeline (Sequence[dict]) – A list of dict, where each element represents a operation defined in mmcls.datasets.pipelines. Defaults to an empty tuple.
classes (str | Sequence[str], optional) –
Specify names of classes.
- If is string, it should be a file path, and the every line of the file is a name of a class.
- If is a sequence of string, every item is a name of class.
- If is None, use cls.CLASSES or the names of sub folders (If use the second way to arrange samples).
Defaults to None.
ann_file (str, optional) – The annotation file. If is string, read samples paths from the ann_file. If is None, find samples in data_prefix. Defaults to None.
extensions (Sequence[str]) – A sequence of allowed extensions. Defaults to (‘.jpg’, ‘.jpeg’, ‘.png’, ‘.ppm’, ‘.bmp’, ‘.pgm’, ‘.tif’).
test_mode (bool) – In train mode or test mode. It’s only a mark and won’t be used in this class. Defaults to False.
file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmcv.fileio.FileClient for details. If None, automatically inference from the specified path. Defaults to None.

ImageNet¶

class mmcls.datasets.ImageNet(data_prefix: str, pipeline: Sequence = (), classes: Optional[Union[str, Sequence[str]]] = None, ann_file: Optional[str] = None, test_mode: bool = False, file_client_args: Optional[dict] = None)[source]¶

ImageNet Dataset.

The dataset supports two kinds of annotation format. More details can be found in CustomDataset.

Parameters

data_prefix (str) – The path of data directory.
pipeline (Sequence[dict]) – A list of dict, where each element represents a operation defined in mmcls.datasets.pipelines. Defaults to an empty tuple.
classes (str | Sequence[str], optional) –
Specify names of classes.
- If is string, it should be a file path, and the every line of the file is a name of a class.
- If is a sequence of string, every item is a name of class.
- If is None, use the default ImageNet-1k classes names.
Defaults to None.
ann_file (str, optional) – The annotation file. If is string, read samples paths from the ann_file. If is None, find samples in data_prefix. Defaults to None.
extensions (Sequence[str]) – A sequence of allowed extensions. Defaults to (‘.jpg’, ‘.jpeg’, ‘.png’, ‘.ppm’, ‘.bmp’, ‘.pgm’, ‘.tif’).
test_mode (bool) – In train mode or test mode. It’s only a mark and won’t be used in this class. Defaults to False.
file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmcv.fileio.FileClient for details. If None, automatically inference from the specified path. Defaults to None.

class mmcls.datasets.ImageNet21k(data_prefix: str, pipeline: Sequence = (), classes: Optional[Union[str, Sequence[str]]] = None, ann_file: Optional[str] = None, serialize_data: bool = True, multi_label: bool = False, recursion_subdir: bool = True, test_mode=False, file_client_args: Optional[dict] = None)[source]¶

ImageNet21k Dataset.

Since the dataset ImageNet21k is extremely big, cantains 21k+ classes and 1.4B files. This class has improved the following points on the basis of the class ImageNet, in order to save memory, we enable the serialize_data optional by default. With this option, the annotation won’t be stored in the list data_infos, but be serialized as an array.

Parameters

data_prefix (str) – The path of data directory.
pipeline (Sequence[dict]) – A list of dict, where each element represents a operation defined in mmcls.datasets.pipelines. Defaults to an empty tuple.
classes (str | Sequence[str], optional) –
Specify names of classes.
- If is string, it should be a file path, and the every line of the file is a name of a class.
- If is a sequence of string, every item is a name of class.
- If is None, the object won’t have category information. (Not recommended)
Defaults to None.
ann_file (str, optional) – The annotation file. If is string, read samples paths from the ann_file. If is None, find samples in data_prefix. Defaults to None.
serialize_data (bool) – Whether to hold memory using serialized objects, when enabled, data loader workers can use shared RAM from master process instead of making a copy. Defaults to True.
multi_label (bool) – Not implement by now. Use multi label or not. Defaults to False.
recursion_subdir (bool) – Deprecated, and the dataset will recursively get all images now.
test_mode (bool) – In train mode or test mode. It’s only a mark and won’t be used in this class. Defaults to False.
file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmcv.fileio.FileClient for details. If None, automatically inference from the specified path. Defaults to None.

CIFAR¶

class mmcls.datasets.CIFAR10(data_prefix, pipeline, classes=None, ann_file=None, test_mode=False)[source]¶

CIFAR10 Dataset.

This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/cifar.py

class mmcls.datasets.CIFAR100(data_prefix, pipeline, classes=None, ann_file=None, test_mode=False)[source]¶: CIFAR100 Dataset.

MNIST¶

class mmcls.datasets.MNIST(data_prefix, pipeline, classes=None, ann_file=None, test_mode=False)[source]¶

MNIST Dataset.

This implementation is modified from https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py

class mmcls.datasets.FashionMNIST(data_prefix, pipeline, classes=None, ann_file=None, test_mode=False)[source]¶: Fashion-MNIST Dataset.

VOC¶

class mmcls.datasets.VOC(**kwargs)[source]¶: Pascal VOC Dataset.

Base classes¶

class mmcls.datasets.BaseDataset(data_prefix, pipeline, classes=None, ann_file=None, test_mode=False)[source]¶

Base dataset.

Parameters

data_prefix (str) – the prefix of data path
pipeline (list) – a list of dict, where each element represents a operation defined in mmcls.datasets.pipelines
ann_file (str | None) – the annotation file. When ann_file is str, the subclass is expected to read from the ann_file. When ann_file is None, the subclass is expected to read according to data_prefix
test_mode (bool) – in train mode or test mode

class mmcls.datasets.MultiLabelDataset(data_prefix, pipeline, classes=None, ann_file=None, test_mode=False)[source]¶: Multi-label Dataset.

Multi-Task Dataset¶

class mmcls.datasets.MultiTaskDataset(ann_file: str, metainfo: Optional[dict] = None, data_root: Optional[str] = None, data_prefix: Optional[str] = None, pipeline: Sequence = (), test_mode: bool = False, file_client_args: Optional[dict] = None)[source]¶

Custom dataset for multi-task dataset.

To use the dataset, please generate and provide an annotation file in the below format:

{
  "metainfo": {
    "tasks":
      [
        {"name": "gender",
         "type": "single-label",
         "categories": ["male", "female"]},
        {"name": "wear",
         "type": "multi-label",
         "categories": ["shirt", "coat", "jeans", "pants"]}
      ]
  },
  "data_list": [
    {
      "img_path": "a.jpg",
      "gender_img_label": 0,
      "wear_img_label": [1, 0, 1, 0]
    },
    {
      "img_path": "b.jpg",
      "gender_img_label": 1,
      "wear_img_label": [0, 1, 0, 1]
    }
  ]
}

Assume we put our dataset in the data/mydataset folder in the repository and organize it as the below format:

mmclassification/
└── data
    └── mydataset
        ├── annotation
        │   ├── train.json
        │   ├── test.json
        │   └── val.json
        ├── train
        │   ├── a.jpg
        │   └── ...
        ├── test
        │   ├── b.jpg
        │   └── ...
        └── val
            ├── c.jpg
            └── ...

We can use the below config to build datasets:

>>> from mmcls.datasets import build_dataset
>>> train_cfg = dict(
...     type="MultiTaskDataset",
...     ann_file="annotation/train.json",
...     data_root="data/mydataset",
...     # The `img_path` field in the train annotation file is relative
...     # to the `train` folder.
...     data_prefix='train',
... )
>>> train_dataset = build_dataset(train_cfg)

Or we can put all files in the same folder:

mmclassification/
└── data
    └── mydataset
         ├── train.json
         ├── test.json
         ├── val.json
         ├── a.jpg
         ├── b.jpg
         ├── c.jpg
         └── ...

And we can use the below config to build datasets:

>>> from mmcls.datasets import build_dataset
>>> train_cfg = dict(
...     type="MultiTaskDataset",
...     ann_file="train.json",
...     data_root="data/mydataset",
...     # the `data_prefix` is not required since all paths are
...     # relative to the `data_root`.
... )
>>> train_dataset = build_dataset(train_cfg)

Parameters

ann_file (str) – The annotation file path. It can be either absolute path or relative path to the data_root.
metainfo (dict, optional) – The extra meta information. It should be a dict with the same format as the "metainfo" field in the annotation file. Defaults to None.
data_root (str, optional) – The root path of the data directory. It’s the prefix of the data_prefix and the ann_file. And it can be a remote path like “s3://openmmlab/xxx/”. Defaults to None.
data_prefix (str, optional) – The base folder relative to the data_root for the "img_path" field in the annotation file. Defaults to None.
pipeline (Sequence[dict]) – A list of dict, where each element represents a operation defined in mmcls.datasets.pipelines. Defaults to an empty tuple.
test_mode (bool) – in train mode or test mode. Defaults to False.
file_client_args (dict, optional) – Arguments to instantiate a FileClient. See mmcv.fileio.FileClient for details. If None, automatically inference from the data_root. Defaults to None.

Dataset Wrappers¶

class mmcls.datasets.ConcatDataset(datasets, separate_eval=True)[source]¶

A wrapper of concatenated dataset.

Same as torch.utils.data.dataset.ConcatDataset, but add get_cat_ids function.

Parameters

datasets (list[BaseDataset]) – A list of datasets.
separate_eval (bool) – Whether to evaluate the results separately if it is used as validation dataset. Defaults to True.

class mmcls.datasets.RepeatDataset(dataset, times)[source]¶

A wrapper of repeated dataset.

The length of repeated dataset will be times larger than the original dataset. This is useful when the data loading time is long but the dataset is small. Using RepeatDataset can reduce the data loading time between epochs.

Parameters

dataset (BaseDataset) – The dataset to be repeated.
times (int) – Repeat times.

class mmcls.datasets.ClassBalancedDataset(dataset, oversample_thr)[source]¶

A wrapper of repeated dataset with repeat factor.

Suitable for training on class imbalanced datasets like LVIS. Following the sampling strategy in this paper, in each epoch, an image may appear multiple times based on its “repeat factor”.

The repeat factor for an image is a function of the frequency the rarest category labeled in that image. The “frequency of category c” in [0, 1] is defined by the fraction of images in the training set (without repeats) in which category c appears.

The dataset needs to implement self.get_cat_ids() to support ClassBalancedDataset.

The repeat factor is computed as followed.

For each category c, compute the fraction \(f(c)\) of images that contain it.
For each category c, compute the category-level repeat factor

\[r(c) = \max(1, \sqrt{\frac{t}{f(c)}})\]
For each image I and its labels \(L(I)\), compute the image-level repeat factor

\[r(I) = \max_{c \in L(I)} r(c)\]

Parameters

dataset (BaseDataset) – The dataset to be repeated.
oversample_thr (float) – frequency threshold below which data is repeated. For categories with f_c >= oversample_thr, there is no oversampling. For categories with f_c < oversample_thr, the degree of oversampling following the square-root inverse frequency heuristic above.