GECToR

This repository provides code for training and testing state-of-the-art models for grammatical error correction with the official PyTorch implementation of the following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi
Grammarly
15th Workshop on Innovative Use of NLP for Building Educational Applications (co-located with ACL 2020)

It is mainly based on AllenNLP and transformers.

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the paper can be downloaded from here.
Synthetically created datasets can be generated/downloaded here.
To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Pretrained models

Pretrained encoder	Confidence bias	Min error prob	CoNNL-2014 (test)	BEA-2019 (test)
BERT [link]	0.10	0.41	63.0	67.6
RoBERTa [link]	0.20	0.50	64.0	71.5
XLNet [link]	0.35	0.66	65.3	72.4
RoBERTa + XLNet	0.24	0.45	66.0	73.7
BERT + RoBERTa + XLNet	0.16	0.40	66.5	73.6

Train model

To train the model, simply run:

python train.py --train_set TRAIN_SET --dev_set DEV_SET \
                --model_dir MODEL_DIR

There are a lot of parameters to specify among them:

cold_steps_count the number of epochs where we train only last linear layer
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
tn_prob probability of getting sentences with no errors; helps to balance precision/recall
pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Among parameters:

min_error_probability - minimum error probability (as in the paper)
additional_confidence - confidence bias (as in the paper)
special_tokens_fix to reproduce some reported results of pretrained models

For evaluation use M^2Scorer and ERRANT.

Citation

If you find this work is useful for your research, please cite our paper:

@misc{omelianchuk2020gector,
    title={GECToR -- Grammatical Error Correction: Tag, Not Rewrite},
    author={Kostiantyn Omelianchuk and Vitaliy Atrasevych and Artem Chernodub and Oleksandr Skurzhanskyi},
    year={2020},
    eprint={2005.12592},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

gector

utils

utils

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

predict.py

predict.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

GECToR – Grammatical Error Correction: Tag, Not Rewrite

Installation

Datasets

Pretrained models

Train model

Model inference

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
gector		gector
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

License

tagucci/gector

Folders and files

Latest commit

History

Repository files navigation

GECToR – Grammatical Error Correction: Tag, Not Rewrite

Installation

Datasets

Pretrained models

Train model

Model inference

Citation

About

Resources

License

Stars

Watchers

Forks

Languages