Skip to content
/ gector Public
forked from grammarly/gector

Official implementation of the paper "GECToR – Grammatical Error Correction: Tag, Not Rewrite" // Accepted for publication on BEA15 Workshop (co-located with ACL 2020) https://arxiv.org/abs/2005.12592

License

Notifications You must be signed in to change notification settings

tagucci/gector

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GECToR – Grammatical Error Correction: Tag, Not Rewrite

This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi
Grammarly
15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)

It is mainly based on AllenNLP and transformers.

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the paper can be downloaded from here.
Synthetically created datasets can be generated/downloaded here.
To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Pretrained models

Pretrained encoder Confidence bias Min error prob CoNNL-2014 (test) BEA-2019 (test)
BERT [link] 0.10 0.41 63.0 67.6
RoBERTa [link] 0.20 0.50 64.0 71.5
XLNet [link] 0.35 0.66 65.3 72.4
RoBERTa + XLNet 0.24 0.45 66.0 73.7
BERT + RoBERTa + XLNet 0.16 0.40 66.5 73.6

Train model

To train the model, simply run:

python train.py --train_set TRAIN_SET --dev_set DEV_SET \
                --model_dir MODEL_DIR

There are a lot of parameters to specify among them:

  • cold_steps_count the number of epochs where we train only last linear layer
  • transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
  • tn_prob probability of getting sentences with no errors; helps to balance precision/recall
  • pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Among parameters:

  • min_error_probability - minimum error probability (as in the paper)
  • additional_confidence - confidence bias (as in the paper)
  • special_tokens_fix to reproduce some reported results of pretrained models

For evaluation use M^2Scorer and ERRANT.

Citation

If you find this work is useful for your research, please cite our paper:

@misc{omelianchuk2020gector,
    title={GECToR -- Grammatical Error Correction: Tag, Not Rewrite},
    author={Kostiantyn Omelianchuk and Vitaliy Atrasevych and Artem Chernodub and Oleksandr Skurzhanskyi},
    year={2020},
    eprint={2005.12592},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

About

Official implementation of the paper "GECToR – Grammatical Error Correction: Tag, Not Rewrite" // Accepted for publication on BEA15 Workshop (co-located with ACL 2020) https://arxiv.org/abs/2005.12592

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%