Skip to content

snu-mllab/Efficient-Dataset-Condensation

Repository files navigation

Efficient-Dataset-Condensation

Official PyTorch implementation of "Dataset Condensation via Efficient Synthetic-Data Parameterization", published at ICML'22

image samples

Abstract The great success of machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning. Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset. However, the existing approaches have fundamental limitations in optimization due to the limited representability of synthetic datasets without considering any data regularity characteristics. To this end, we propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity. We further analyze the shortcomings of the existing gradient matching-based condensation methods and develop an effective optimization technique for improving the condensation of training data information. We propose a unified algorithm that drastically improves the quality of condensed data against the current state-of-the-art on CIFAR-10, ImageNet, and Speech Commands.

Basic results (Data Drive)

Top-1 test accuracies with ConvNet-3 (ResNetAP-10 for ImageNet)

  • 1 images/class
Method CIFAR-10 SVHN MNIST FashionMNIST
IDC-I 36.7 46.7 88.9 70.7
IDC 50.6 68.5 94.2 81.0
  • 10 images/class
Method CIFAR-10 CIFAR-100 SVHN MNIST FashionMNIST ImageNet-10 ImageNet-100
IDC-I 58.3 36.6 77.0 98.0 85.3 61.4 29.2
IDC 67.5 45.1 87.5 98.4 86.0 72.8 46.7
  • 20 images/class
Method CIFAR-100 ImageNet-10 ImageNet-100
IDC-I 41.5 65.5 34.5
IDC 49.0 76.6 53.7
  • 50 images/class
Method CIFAR-10 SVHN MNIST FashionMNIST
IDC-I 69.5 87.9 98.8 89.1
IDC 74.5 90.1 99.1 86.2

Requirements

  • The code has been tested with PyTorch 1.11.0.
  • To run the codes, install efficientnet package pip install efficientnet_pytorch

Updates

  • (2022.06.28) We uploaded CIFAR-100 condensed data with 10 and 20 images per class (link). Note, all of our data are unnormalized, so normalization is required for training.
  • (2022.06.28) We uploaded codes for speech data (see ./speech).

Test Condensed Data

Download data

You can download condensed data evaluated in our paper from Here.

  • The possible datasets are CIFAR-10, MNIST, SVHN, FashionMNIST, and ImageNet (10, 100 subclasses).
  • To test data, download the entire dataset folder (e.g., cifar10) and locate the folder at ./results.

Training neural networks on condensed data

  • Set --data_dir and --imagenet_dir in argument.py to point the folder containing the original dataset (required for measuring test accuracy).

Then run the following codes:

python test.py -d [dataset] -n [network] -f [factor] --ipc [image/class] --repeat [#repetition]
  • To evaluate IDC-I, set -f 1. To evaluate IDC, set -f 3 for ImageNet and -f 2 for others.
  • For detailed explanation for arguments, please refer to argument.py

As an example,

  • To evaluate IDC (10 images/class) on CIFAR-10 and ConvNet-3 for 3 times, run
    python test.py -d cifar10 -n convnet -f 2 --ipc 10 --repeat 3
    
  • To evaluate IDC (20 images/class) on ImageNet with 10 classes and ResNetAP-10 for 3 times, run
    python test.py -d imagenet --nclass 10 -n resnet_ap -f 3 --ipc 20 --repeat 3
    

With 10 images/class condensed data, the top-1 test accuracies of ConvNet-3 (ResNetAP-10 for ImageNet) are about

Method CIFAR-10 CIFAR-100 SVHN MNIST FashionMNIST ImageNet-10 ImageNet-100
IDC-I 58.3 36.6 77.0 98.0 85.3 61.4 29.2
IDC 67.5 45.1 87.5 98.4 86.0 72.8 46.7

You can also test other condensed methods by setting -s [dsa, kip, random, herding]

  • We provide DSA and KIP datasets in the case of CIFAR-10.
  • To evaluate Herding, download the pretrained networks (link) at ./results. You can modify the location of the pretrained networks at coreset.py (load_pretrained_herding fn).

Optimize Condensed Data

To reproduce our condensed data (except for ImageNet-100), simply run

python condense.py --reproduce  -d [dataset] -f [factor] --ipc [image/class]
  • Set --data_dir and --imagenet_dir in argument.py to point the folder containing the original dataset.
  • The results will be saved at ./results/[dataset]/[expname].
  • We provide specific argument settings for each dataset at ./misc/reproduce.py.
  • In the case of ImageNet-100, we use the tricks below for faster optimization.

Faster optimization

  1. Utilizing pretrained networks

    • To train pretrained networks (which were used in condensation stage), run
    python pretrain.py -d imagenet --nclass 100 -n resnet_ap --pt_from [pretrain epochs] --seed [seed]
    
    • In our ImageNet-100 experiments, we used --pt_from 5 and train networks with 10 random seeds.
    • For ImageNet-10, --pt_from 10 will be good.
  2. Multi-processing

    • We partition the classes and do condensation with multiple processors (condense_mp.py).
    • --nclass_sub means the number of classes per partition and --phase indicates the partition number.

To sum up, after saving the pretrained models, run

python condense_mp.py --reproduce  -d imagenet --nclass 100 --pt_from 5 -f [factor] --ipc [image/class] --nclass_sub 20 --phase [0,1,2,3,4]
  • You need to assign a different phase number to each processor.
  • In the test code, we aggregate the resulted condensed data.
  • To reduce the memory requirement, use smaller --nclass_sub (#phase = #class/nclass_sub).

Train Networks on Original Training Set

python train.py -d [dataset] -n [network]
  • Our code load data on memory at the beginning. If you don't want this, set -l False.
  • For ImageNet, you can choose the number of subclasses by --nclass [#class].
  • To save checkpoints, set --save_ckpt.

Citation

@inproceedings{kimICML22,
title = {Dataset Condensation via Efficient Synthetic-Data Parameterization},
author = {Kim, Jang-Hyun and Kim, Jinuk and Oh, Seong Joon and Yun, Sangdoo and Song, Hwanjun and Jeong, Joonhyun and Ha, Jung-Woo and Song, Hyun Oh},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2022}
}

About

Official PyTorch implementation of "Dataset Condensation via Efficient Synthetic-Data Parameterization" (ICML'22)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages