Skip to content

google-research-datasets/clang8

Repository files navigation

cLang-8 Dataset

cLang-8 (“cleaned Lang-8”) is a dataset for grammatical error correction (GEC). The source sentences originate from the popular NAIST Lang-8 Learner Corpora, while the target sentences are generated by our state-of-the-art GEC method called gT5. The method is described in our ACL-IJCNLP 2021 paper.

The paper shows that fine-tuning a T5-11B model on cLang-8 yields SOTA performance on GEC for English. cLang-8 thus simplifies a typical GEC training pipeline consisting of multiple fine-tuning stages.

Dataset Preparation

cLang-8 is generated by combining the target sentences found under targets/ directory of this repository with the source sentences from the original Lang-8 corpus which has to be downloaded separately. Specifically, you need to complete the following steps:

  1. Install Git Large File Storage (if not already installed) and clone this repository.
  2. Fill this form, after which you will receive an email with a link to “the raw format containing all the data up to 2010”.
  3. Follow the link to download a zip file and extract it.
  4. Update the LANG8_DIR variable in run.sh to point to the resulting extracted directory.
  5. Run command sh run.sh which will install the required Python 3 dependencies in a virtualenv and align the source and the target sentences.

NB: Running the above script takes about 1 hour when spaCy tokenization is enabled (recommended to make tokenization consistent with CoNLL-14 (see also the next section) and BEA eval sets).

Tokenization Post-Processing for CoNLL-14

After training a model and computing predictions on the CoNLL-14 test set for the paper, we ran some post-processing steps found in retokenize.py to fix tokenization discrepancies. This improves the F0.5 scores by about 2.5 points (for T5 xxl).

You may instead want to try applying the post-processing steps to cLang-8 targets before training a model.

Data Format

The resulting cLang-8 data files will be saved under ./output_data/ directory and they will be TSV files with a single tab-separated (source, target) pair per line. Three separate TSV files will be generated for the following languages:

Language Number of examples
English 2,372,119
German 114,405
Russian 44,830

How to Cite cLang-8

Please cite the following works if you use cLang-8:

@inproceedings{rothe2021a,
  title = {{A Simple Recipe for Multilingual Grammatical Error Correction}},
  author = {Rothe, Sascha and Mallinson, Jonathan and Malmi, Eric and Krause, Sebastian and Severyn, Aliaksei},
  booktitle = {Proc. of ACL-IJCNLP},
  year = {2021}
}

@inproceedings{mizumoto2011mining,
  title={{Mining revision log of language learning SNS for automated Japanese error correction of second language learners}},
  author={Mizumoto, Tomoya and Komachi, Mamoru and Nagata, Masaaki and Matsumoto, Yuji},
  booktitle={Proc. of 5th International Joint Conference on Natural Language Processing},
  pages={147--155},
  year={2011}
}

License

Similar to the original Lang-8 corpus, cLang-8 is distributed for research and educational purposes only. Specifically, cLang-8 is released under CC BY-NC-SA 4.0 license.

The code is distributed under Apache 2.0 license.

Contact Us

If you have a technical question regarding the dataset, code, or publication, please create an issue in this repository.

About

cLang-8 is a dataset for grammatical error correction.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published