Skip to content

allenai/c4-documentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

C4 Documetation

This is a companion website for our paper Documenting the English Colossal Clean Crawled Corpus. We present some of the first documentation for the contents of the Colossal Clean Crawled Corpus (C4), a massive web-crawled dataset used for pretraining large language models like T5.

Accessing C4

Raw data download: If you want to download C4 to run your own analysis on it, we have made the raw C4 data available to download here.

Search index: To search through the data using our search engine, we made https://c4-search.apps.allenai.org/. Check it out!

Documenting issues

Our paper documents many interesting and/or problematic issues with the C4 dataset, but it is in no way an exhaustive exploration of the data. We hope the discussions page on this github repository will be a welcome place to document and discuss further issues with the data. If you find something interesting that you want to share, please start a new discussion and let everyone know what you found.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published