Papers and resources on data augmentation using large models

A collection of papers and resources related to large model-based data augmentation methods.

A Survey on Data Augmentation in Large Model Era

Yue Zhou^*1 Chenlu Guo^*1 Xu Wang¹ Yi Chang¹ Yuan Wu^#1

¹ Jilin University
(*: Co-first authors, #: Corresponding authors)

Papers and resources on data augmentation using large models

The papers are organized according to our survey: A Survey on Data Augmentation in Large Model Era.

NOTE: As real-time updates on the arXiv paper aren't possible, please refer to this repository for the latest information. We appreciate your contributions through pull requests or issue reports to enhance the survey, and your efforts will be acknowledged in (acknowledgements).

Related projects:

Evlauation of large language models: [LLM-eval]

Table of Contents

News and Updates
Approaches
Applications
Data Post Processing
Contributing
Citation
Acknowledgments

News and updates

[03/04/2024] The second version of the paper is released on arXiv: A Survey on Data Augmentation in Large Model Era.
[01/27/2024] The first version of the paper is released on arXiv: A Survey on Data Augmentation in Large Model Era.

Approaches

Image Augmentation

Prompt-driven approaches

Text Prompt-driven

Camdiff: Camouflage image augmentation via diffusion model. Luo, X.-J. et al. arKiv 2023. [paper][code]
Diffedit: Diffusion-based semantic image editing with mask guidance. Couairon, G. et al. arKiv 2022. [paper]
Glide: Towards photorealistic image generation and editing with text-guided diffusion models. Nichol, A. et al. arXiv 2021. [paper][code]
It is all about where you start: Text-to-image generation with seed selection. Samuel, D. et al. arXiv 2023. [paper]
Plug-and-play diffusion features for text-driven image-to-image translation. Tumanyan, N. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
Prompt-to-prompt image editing with cross attention control. Hertz, A. et al. arXiv 2022. [paper][code]
Localizing Object-level Shape Variations with Text-to-Image Diffusion Models. Patashnik, O. et al. arXiv 2023. [paper][code]
Sine: Single image editing with text-to-image diffusion models. Zhang, Z. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
Text2live: Text-driven layered image and video editing. Bar-Tal, O. et al. European conference on computer vision. [paper][code]
Diffusionclip: Text-guided diffusion models for robust image manipulation. Kim, G. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
StyleGAN-NADA: CLIP-guided domain adaptation of image generators. Gal, R. et al. ACM Transactions on Graphics (TOG). [paper][code]
Diversify your vision datasets with automatic diffusion-based augmentation. Dunlap, L. et al. arXiv 2023. [paper][code]
Effective data augmentation with diffusion models. Trabucco, B. et al. arXiv 2023. [paper][code]
Imagic: Text-based real image editing with diffusion models. Kawar, B. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper]
TTIDA: Controllable Generative Data Augmentation via Text-to-Text and Text-to-Image Models. Yin, Y. et al. arXiv 2023. [paper][code]
Blended diffusion for text-driven editing of natural images. Avrahami, O. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing. Wang, K. et al. arXiv 2023. [paper][code]
Instructpix2pix: Learning to follow image editing instructions. Brooks, T. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
Expressive text-to-image generation with rich text. Ge, S. et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. [paper][code]
GeNIe: Generative Hard Negative Images Through Diffusion. Koohpayegani, S. et al. arXiv 2023. [paper][code]
Semantic Generative Augmentations for Few-Shot Counting. Doubinsky, P. et al. arXiv 2023. [paper]
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset. Feng, C. et al. arXiv 2024. [paper][code]
Cross domain generative augmentation: Domain generalization with latent diffusion models. Hemati, S. et al. arXiv 2023. [paper]

Visual Prompt-driven

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation. Sun, Y. et al. arXiv 2023. [paper]
Diffusion-based data augmentation for nuclei image segmentation. Yu, X. et al. International Conference on Medical Image Computing and Computer-Assisted Intervention. [paper][code]
Image Augmentation with Controlled Diffusion for Weakly-Supervised Semantic Segmentation. Wu, W. et al. arXiv 2023. [paper]
More control for free! image synthesis with semantic diffusion guidance. Liu, X. et al. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. [paper][code]

Multimodal Prompt-driven

Visual instruction inversion: Image editing via visual prompting. Nguyen, T. et al. arXiv 2023. [paper][code]
In-context learning unlocked for diffusion models. Wang, Z. et al. arXiv 2023. [paper][code]
Smartbrush: Text and shape guided object inpainting with diffusion model. Xie, S. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper]
ReVersion: Diffusion-Based Relation Inversion from Images. Huang, Z. et al. arXiv 2023. [paper][code]
Gligen: Open-set grounded text-to-image generation. Li, Y. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
Adding conditional control to text-to-image diffusion models. Zhang, L. et al. Proceedings of the IEEE/CVF International Conference on Computer Vision. [paper][code]
Boosting Dermatoscopic Lesion Segmentation via Diffusion Models with Visual and Textual Prompts. Du, S. et al. arXiv 2023. [paper]
Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation. Schnell, J. et al. arXiv 2023. [paper]
Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities. Erfanian, M. et al. arXiv 2024. [paper][code]
Diffusion-based Data Augmentation for Object Counting Problems. Wang, Z. et al. arXiv 2024. [paper]

Subject-driven approaches

An image is worth one word: Personalizing text-to-image generation using textual inversion. Gal, R. et al. arXiv 2022. [paper][code]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. Ruiz, N. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
Instantbooth: Personalized text-to-image generation without test-time finetuning. Shi, J. et al. arXiv 2023. [paper]
Unified multi-modal latent diffusion for joint subject and text conditional image generation. Ma, Y. et al. arXiv 2023. [paper]
Multi-concept customization of text-to-image diffusion. Kumari, N. et al. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. [paper][code]
Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Li, D. et al. arXiv 2023. [paper][code]
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention. Xiao, G. et al. arXiv 2023. [paper][code]
Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. Wei, Y. et al. arXiv 2023. [paper][code]
Subject-driven text-to-image generation via apprenticeship learning. Chen, W. et al. arXiv 2023. [paper]

Text Augmentation

Label-based approaches

Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. Thakur, N. et al. arXiv 2020. [paper][code]
Data augmentation using pre-trained transformer models. Kumar, V. et al. arXiv 2020. [paper][code]
Data augmentation for intent classification with off-the-shelf large language models. Sahu, G. et al. arXiv 2022. [paper][code]
GPT3Mix: Leveraging large-scale language models for text augmentation. Yoo, K. et al. arXiv 2021. [paper][code]
Augmenting text for spoken language understanding with Large Language Models. Sharma, R. et al. arXiv 2023. [paper]
Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges. Samuel, V. et al. arXiv 2023. [paper]
Can large language models aid in annotating speech emotional data? uncovering new frontiers. Latif, S. et al. arXiv 2023. [paper]
Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering. Chowdhury, A. et al. arXiv 2023. [paper]
MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering. Chen, X. et al. arXiv 2023. [paper]
Text Data Augmentation in Low-Resource Settings via Fine-Tuning of Large Language Models. Kaddour, J. et al. arXiv 2023. [paper]
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation. Wu, S. et al. arXiv 2023. [paper][code]
LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition. Ye, J. et al. arXiv 2024. [paper]

Generated content-based approaches

Augesc: Dialogue augmentation with large language models for emotional support conversation. Zheng, C. et al. Findings of the Association for Computational Linguistics: ACL 2023. [paper][code]
Chataug: Leveraging chatgpt for text data augmentation. Dai, H. et al. arXiv 2023. [paper][code]
Coca: Contrastive captioners are image-text foundation models. Yu, J. et al. arXiv 2022. [paper][code]
DAGAM: Data Augmentation with Generation And Modification. Jo, B. et al. arXiv 2022. [paper][code]
Data augmentation for neural machine translation using generative language model. Oh, S. et al. arXiv 2023. [paper]
Deep Transformer based Data Augmentation with Subword Units for Morphologically Rich Online ASR. Tarj{'a}n, B. et al. arXiv 2020. [paper]
Flipda: Effective and robust data augmentation for few-shot learning. Zhou, J. et al. arXiv 2021. [paper][code]
Genius: Sketch-based language model pre-training via extreme and selective masking for text generation and augmentation. Guo, B. et al. arXiv 2022. [paper][code]
Inpars: Data augmentation for information retrieval using large language models. Bonifacio, L. et al. arXiv 2022. [paper][code]
SkillBot: Towards Data Augmentation using Transformer language model and linguistic evaluation. Khatri, S. et al. 2022 Human-Centered Cognitive Systems (HCCS). [paper]
Textual data augmentation for efficient active learning on tiny datasets. Quteineh, H. et al. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). [paper]
Wanli: Worker and ai collaboration for natural language inference dataset creation. Liu, A. et al. arXiv 2022. [paper][code]
EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets. Lu, H. et al. arXiv 2023. [paper]
Tuning language models as training data generators for augmentation-enhanced few-shot learning. Meng, Y. et al. International Conference on Machine Learning. [paper][code]
Generating training data with language models: Towards zero-shot language understanding. Meng, Y. et al. Advances in Neural Information Processing Systems. [paper][code]
ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer. Saakyan, A. et al. arXiv 2023. [paper][code]
Natural Language Dataset Generation Framework for Visualizations Powered by Large Language Models. Ko, H. et al. arXiv 2023. [paper][code]
PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records. Schlegel, V. et al. arXiv 2023. [paper][code]
Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning. Gao, J. et al. The Eleventh International Conference on Learning Representations. [paper][code]
Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation. Cai, X. et al. arXiv 2023. [paper]
Just-in-Time Security Patch Detection--LLM At the Rescue for Data Augmentation. Tang, X. et al. arXiv 2023. [paper][code]
DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase. Li, D. et al. arXiv 2023. [paper]
ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT. Ubani, S. et al. arXiv 2023. [paper]
Large Language Models as Data Augmenters for Cold-Start Item Recommendation. Wang, J. et al. arXiv 2024. [paper]

Paired Augmentation

Mixgen: A new multi-modal data augmentation. Hao, X. et al. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. [paper][code]
PromptMix: Text-to-image diffusion models enhance the performance of lightweight networks. Bakhtiarnia, A. et al. arXiv 2023. [paper]
Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association. Wu, Q. et al. arXiv 2023. [paper]

Applications

Natural Language Processing

Text classification

Chataug: Leveraging chatgpt for text data augmentation. Dai, H. et al. arXiv 2023. [paper][code]
DAGAM: Data Augmentation with Generation And Modification. Jo, B. et al. arXiv 2022. [paper][code]
Data augmentation using pre-trained transformer models. Kumar, V. et al. arXiv 2020. [paper][code]
Self-Guided Noise-Free Data Generation for Efficient Zero-Shot Learning. Gao, J. et al. The Eleventh International Conference on Learning Representations. [paper][code]
Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation. Cai, X. et al. arXiv 2023. [paper]
Genius: Sketch-based language model pre-training via extreme and selective masking for text generation and augmentation. Guo, B. et al. arXiv 2022. [paper][code]
Tuning language models as training data generators for augmentation-enhanced few-shot learning. Meng, Y. et al. International Conference on Machine Learning. [paper][code]
ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer. Saakyan, A. et al. arXiv 2023. [paper][code]
DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase. Li, D. et al. arXiv 2023. [paper]

Question answering

MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering. Chen, X. et al. arXiv 2023. [paper]
Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering. Chowdhury, A. et al. arXiv 2023. [paper]
Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges. Samuel, V. et al. arXiv 2023. [paper]
CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration. Sachdeva, R. et al. arXiv 2023. [paper][code]

Machine translation

EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets. Lu, H. et al. arXiv 2023. [paper]
Data augmentation for neural machine translation using generative language model. Oh, S. et al. arXiv 2023. [paper]

Natural language inference

EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets. Lu, H. et al. arXiv 2023. [paper]
Wanli: Worker and ai collaboration for natural language inference dataset creation. Liu, A. et al. arXiv 2022. [paper][code]

Dialogue summarising

EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets. Lu, H. et al. arXiv 2023. [paper]
PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records. Schlegel, V. et al. arXiv 2023. [paper][code]

Others

Augesc: Dialogue augmentation with large language models for emotional support conversation. Zheng, C. et al. Findings of the Association for Computational Linguistics: ACL 2023. [paper][code]
Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. Thakur, N. et al. arXiv 2020. [paper]
Inpars: Data augmentation for information retrieval using large language models. Bonifacio, L. et al. arXiv 2022. [paper][code]
EPA: Easy Prompt Augmentation on Large Language Models via Multiple Sources and Multiple Targets. Lu, H. et al. arXiv 2023. [paper]
Large Language Models as Data Augmenters for Cold-Start Item Recommendation. Wang, J. et al. arXiv 2024. [paper]
LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition. Ye, J. et al. arXiv 2024. [paper]

Computer Vision

Image classification

It is all about where you start: Text-to-image generation with seed selection. Samuel, D. et al. arXiv 2023. [paper]
Diversify your vision datasets with automatic diffusion-based augmentation. Dunlap, L. et al. arXiv 2023. [paper][code]
Effective data augmentation with diffusion models. Trabucco, B. et al. arXiv 2023. [paper][code]
TTIDA: Controllable Generative Data Augmentation via Text-to-Text and Text-to-Image Models. Yin, Y. et al. arXiv 2023. [paper][code]
Boosting Unsupervised Contrastive Learning Using Diffusion-Based Data Augmentation From Scratch. Zang, Z. et al. arXiv 2023. [paper][code}
GeNIe: Generative Hard Negative Images Through Diffusion. Koohpayegani, S. et al. arXiv 2023. [paper][code]
Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities. Erfanian, M. et al. arXiv 2024. [paper][code]
Cross domain generative augmentation: Domain generalization with latent diffusion models. Hemati, S. et al. arXiv 2023. [paper]

Semantic segmentation

EMIT-Diff: Enhancing Medical Image Segmentation via Text-Guided Diffusion Model. Zhang, Z. et al. arXiv 2023. [paper]
Boosting Dermatoscopic Lesion Segmentation via Diffusion Models with Visual and Textual Prompts. Du, S. et al. arXiv 2023. [paper]
Diffusion-based data augmentation for nuclei image segmentation. Yu, X. et al. International Conference on Medical Image Computing and Computer-Assisted Intervention. [paper][code]
Image Augmentation with Controlled Diffusion for Weakly-Supervised Semantic Segmentation. Wu, W. et al. arXiv 2023. [paper]
Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation. Schnell, J. et al. arXiv 2023. [paper]

Object detection

The Big Data Myth: Using Diffusion Models for Dataset Generation to Train Deep Detection Models. Voetman, R. et al. arXiv 2023. [paper]
WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation. Lu, J. et al. arXiv 2023. [paper][code]
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset. Feng, C. et al. arXiv 2024. [paper][code]
Diffusion-based Data Augmentation for Object Counting Problems. Wang, Z. et al. arXiv 2024. [paper]

Audio signal processing

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation. Wu, S. et al. arXiv 2023. [paper][code]
Augmenting text for spoken language understanding with Large Language Models. Sharma, R. et al. arXiv 2023. [paper]
Can large language models aid in annotating speech emotional data? uncovering new frontiers. Latif, S. et al. arXiv 2023. [paper]
Deep Transformer based Data Augmentation with Subword Units for Morphologically Rich Online ASR. Tarj{'a}n, B. et al. arXiv 2020. [paper]
Adversarial Fine-tuning using Generated Respiratory Sound to Address Class Imbalance. Kim, J. et al. arXiv 2023. [paper][code]

Data Post Processing

Top-K Selection

Inpars: Data augmentation for information retrieval using large language models. Bonifacio, L. et al. arXiv 2022. [paper][code]
Generating training data with language models: Towards zero-shot language understanding. Meng, Y. et al. Advances in Neural Information Processing Systems. [paper][code]
Strata: Self-training with task augmentation for better few-shot learning. Vu, T. et al. arXiv 2021. [paper][code]

Model-based Approaches

CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration. Sachdeva, R. et al. arXiv 2023. [paper][code]
Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges. Samuel, V. et al. arXiv 2023. [paper]
Augmenting text for spoken language understanding with Large Language Models. Sharma, R. et al. arXiv 2023. [paper]
Data augmentation for intent classification with off-the-shelf large language models. Sahu, G. et al. arXiv 2022. [paper][code]
Flipda: Effective and robust data augmentation for few-shot learning. Zhou, J. et al. arXiv 2021. [paper][code]
Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation. Wu, S. et al. arXiv 2023. [paper][code]

Score-based Approaches

DiffuseExpand: Expanding dataset for 2D medical image segmentation using diffusion models. Shao, S. et al. arXiv 2023. [paper][code]
Image Augmentation with Controlled Diffusion for Weakly-Supervised Semantic Segmentation. Wu, W. et al. arXiv 2023. [paper]
Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering. Chowdhury, A. et al. arXiv 2023. [paper]
Augesc: Dialogue augmentation with large language models for emotional support conversation. Zheng, C. et al. Findings of the Association for Computational Linguistics: ACL 2023. [paper][code]
Wanli: Worker and ai collaboration for natural language inference dataset creation. Liu, A. et al. arXiv 2022. [paper][code]

Cluster-based Approaches

Diffusion-based data augmentation for nuclei image segmentation. Yu, X. et al. International Conference on Medical Image Computing and Computer-Assisted Intervention. [paper][code]

Contributing

We welcome contributions to LLM-data-aug-survey! If you'd like to contribute, please follow these steps:

Fork the repository.
Create a new branch incorporating your modifications.
Submit a pull request accompanied by a clear description of the changes you made.

Feel free to open an issue if you have any additions or comments.

Citation

If you find this project useful in your research or work, please consider citing it:

@article{zhou2024survey,
  title={A Survey on Data Augmentation in Large Model Era},
  author={Zhou, Yue and Guo, Chenlu and Wang, Xu and Chang, Yi and Wu, Yuan},
  journal={arXiv preprint arXiv:2401.15422},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
img		img
README.md		README.md

MLGroup-JLU/LLM-data-aug-survey

Folders and files

Latest commit

History

img

img

README.md

README.md

Repository files navigation

A Survey on Data Augmentation in Large Model Era

Papers and resources on data augmentation using large models

News and updates

Approaches

Image Augmentation

Prompt-driven approaches

Subject-driven approaches

Text Augmentation

Label-based approaches

Generated content-based approaches

Paired Augmentation

Applications

Natural Language Processing

Text classification

Question answering

Machine translation

Natural language inference

Dialogue summarising

Others

Computer Vision

Image classification

Semantic segmentation

Object detection

Audio signal processing

Data Post Processing

Top-K Selection

Model-based Approaches

Score-based Approaches

Cluster-based Approaches

Contributing

Citation

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages