GitHub - zjykzj/MPDataset: Custom Iterable Dataset Class for Large-Scale Data Loading

Language: 🇺🇸 🇨🇳

«MPDataset» implements a new iterable-style dataset class for large-scale data loading.

Path mp/ provides a simple implementation. You can find the test under path tests/.

The complete MPDataset implementation has been integrated into the zcls warehouse. You can view it on mp_dataset.py and MPDataset

The following are the test results based on cifar100:

arch	dataset	shuffle	gpu	top1	top5
sfv1_3g1x	CIFAR100	no	1	69.470	91.350
sfv1_3g1x	MPDataset	no	1	67.340	89.560
sfv1_3g1x	GeneralDataset	no	1	1.010	4.960
sfv1_3g1x	CIFAR100	yes	1	70.350	91.040
sfv1_3g1x	MPDataset	yes	1	68.000	90.030
sfv1_3g1x	GeneralDataset	yes	1	68.680	90.660
sfv1_3g1x	CIFAR100	no	3	69.716	91.112
sfv1_3g1x	MPDataset	no	3	67.367	89.652
sfv1_3g1x	GeneralDataset	no	3	1.420	5.879
sfv1_3g1x	CIFAR100	yes	3	70.756	91.972
sfv1_3g1x	MPDataset	yes	3	68.806	90.252
sfv1_3g1x	GeneralDataset	yes	3	68.656	90.472

for dataset item, refer to Dataset
- CIFAR100: use the data class provided by pytorch
- MPDataset: use a custom iterable data class
- GeneralDataset: A wrapper class uses ImageFolder
the complete configuration file is located at configs/

There is no obvious difference in accuracy for MPDataset and GeneralDataset, even better (because I created the data file according to the original data loading order, so I can get better results by disrupting the data first)

There is a very strange period, that is, using the official cifar file can always get better results

Background

Based on the current big data training needs (tens of millions or even hundreds of millions), it is necessary to further optimize the training environment. In the implementation of pytoch, more data can be loaded and preprocessed synchronously through multiple processes. However, each process keeps a copy of the data, although they only need some of it.

In conventional map-style dataset usage, the sampler used in main process and distribute indices for sub-processes. From v1.2, pytorch provides a new iterable-style dataset class IterableDataset, it can define and use sampler in every process. The warehouse defines an iterable-style dataset class for loading large-scale data, which can ensure that each process retains only the part of data it needs.

Maintainers

zhujian - Initial work - zjykzj

Thanks

Contributing

Anyone's participation is welcome! Open an issue or submit PRs.

Small note:

Git submission specifications should be complied with Conventional Commits
If versioned, please conform to the Semantic Versioning 2.0.0 specification
If editing the README, please conform to the standard-readme specification.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
configs		configs
imgs		imgs
mp		mp
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

imgs

imgs

mp

mp

tests

tests

tools

tools

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

README.zh-CN.md

README.zh-CN.md

requirements.txt

requirements.txt

Repository files navigation

Table of Contents

Background

Maintainers

Thanks

Contributing

License

About

Releases

Packages

Languages

License

zjykzj/MPDataset

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Background

Maintainers

Thanks

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages