Skip to content

alon-albalak/data-selection-survey

Repository files navigation

A Survey on Data Selection for Language Models

GitHub stars GitHub forks License

This repo is a convenient listing of papers relevant to data selection for language models, during all stages of training. This is meant to be a resource for the community, so please contribute if you see anything missing!

For more detail on these works, and more, see our survey paper: A Survey on Data Selection for Language Models. By this incredible team: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

A conceptual demonstration of the data pipeline for language model training

Table of Contents

Data Selection for Pretraining

Conceptualization of objectives and constraints on data selection for pretraining

Language Filtering

Back to Table of Contents

Heuristic Approaches

Back to Table of Contents

Data Quality

Back to Table of Contents

Domain-Specific Selection

Back to Table of Contents

Data Deduplication

Back to Table of Contents

Filtering Toxic and Explicit Content

Back to Table of Contents

Specialized Selection for Multilingual Models

Back to Table of Contents

Data Mixing

Back to Table of Contents

Data Selection for Instruction-Tuning and Multitask Training

Conceptualization of objectives and constraints on data selection for instruction-tuning

Back to Table of Contents

Data Selection for Preference Fine-tuning: Alignment

Conceptualization of objectives and constraints on data selection for alignment

Back to Table of Contents

Data Selection for In-Context Learning

Conceptualization of objectives and constraints on data selection for in-context learning

Back to Table of Contents

Data Selection for Task-specific Fine-tuning

Conceptualization of objectives and constraints on data selection for task-specific fine-tuning

Back to Table of Contents

Contribution

There are likely some amazing works in the field that we missed, so please contribute to the repo.

Feel free to open a pull request with new papers or create an issue and we can add them for you. Thank you in advance for your efforts!

Citation

We hope this work serves as inspiration for many impactful future works. If you found our work useful, please cite this paper as:

@misc{albalak2024survey,
      title={A Survey on Data Selection for Language Models}, 
      author={Alon Albalak and Yanai Elazar and Sang Michael Xie and Shayne Longpre and Nathan Lambert and Xinyi Wang and Niklas Muennighoff and Bairu Hou and Liangming Pan and Haewon Jeong and Colin Raffel and Shiyu Chang and Tatsunori Hashimoto and William Yang Wang},
      year={2024},
      eprint={2402.16827},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}