Skip to content

Custom Iterable Dataset Class for Large-Scale Data Loading

License

Notifications You must be signed in to change notification settings

zjykzj/MPDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language: 🇺🇸 🇨🇳

«MPDataset» implements a new iterable-style dataset class for large-scale data loading.

Path mp/ provides a simple implementation. You can find the test under path tests/.

The complete MPDataset implementation has been integrated into the zcls warehouse. You can view it on mp_dataset.py and MPDataset

The following are the test results based on cifar100:

arch dataset shuffle gpu top1 top5
sfv1_3g1x CIFAR100 no 1 69.470 91.350
sfv1_3g1x MPDataset no 1 67.340 89.560
sfv1_3g1x GeneralDataset no 1 1.010 4.960
sfv1_3g1x CIFAR100 yes 1 70.350 91.040
sfv1_3g1x MPDataset yes 1 68.000 90.030
sfv1_3g1x GeneralDataset yes 1 68.680 90.660
sfv1_3g1x CIFAR100 no 3 69.716 91.112
sfv1_3g1x MPDataset no 3 67.367 89.652
sfv1_3g1x GeneralDataset no 3 1.420 5.879
sfv1_3g1x CIFAR100 yes 3 70.756 91.972
sfv1_3g1x MPDataset yes 3 68.806 90.252
sfv1_3g1x GeneralDataset yes 3 68.656 90.472
  • for dataset item, refer to Dataset
    • CIFAR100: use the data class provided by pytorch
    • MPDataset: use a custom iterable data class
    • GeneralDataset: A wrapper class uses ImageFolder
  • the complete configuration file is located at configs/

There is no obvious difference in accuracy for MPDataset and GeneralDataset, even better (because I created the data file according to the original data loading order, so I can get better results by disrupting the data first)

There is a very strange period, that is, using the official cifar file can always get better results

Table of Contents

Background

Based on the current big data training needs (tens of millions or even hundreds of millions), it is necessary to further optimize the training environment. In the implementation of pytoch, more data can be loaded and preprocessed synchronously through multiple processes. However, each process keeps a copy of the data, although they only need some of it.

In conventional map-style dataset usage, the sampler used in main process and distribute indices for sub-processes. From v1.2, pytorch provides a new iterable-style dataset class IterableDataset, it can define and use sampler in every process. The warehouse defines an iterable-style dataset class for loading large-scale data, which can ensure that each process retains only the part of data it needs.

Maintainers

  • zhujian - Initial work - zjykzj

Thanks

Contributing

Anyone's participation is welcome! Open an issue or submit PRs.

Small note:

License

Apache License 2.0 © 2021 zjykzj

Releases

No releases published

Packages

No packages published

Languages