A Low-Resource Approach to the Grammatical Error Correction of Ukranian

This repository contains the code for the paper "A Low-Resource Approach to the Grammatical Error Correction of Ukrainian" presented at EACL 2023 at the UNLP Workshop.

Authors: Frank Palma Gomez, Alla Rozovskaya, Dan Roth

Getting Started

Clone this repo:

$ git clone https://github.com/knarfamlap/low-resource-gec-uk.git

Create a python envoriment:

$ python3 -m venv env
$ source env/bin/activate

Install dependencies

$ python3 -m pip install -r requirements.txt

Training

Use the following script to train mT5-large:

$ python3 train.py --train_tsv /path/to/train/tsv \
    --valid_tsv /path/to/valid/tsv \
    --output_dir /path/to/output/dir \
    --model_name mt5-large \
    --batch_size 32 \
    --max_epochs 3

Inference

To generate hypothesis

$ python3 generate.py --path_to_model /path/to/trained/model \
                      --orig /path/to/inputs

generate.sh is a wrapper file that can generate hypothesis in parallel.

Citation

@inproceedings{gomez-etal-2023-low,
    title = "A Low-Resource Approach to the Grammatical Error Correction of {U}krainian",
    author = "Gomez, Frank  and
      Rozovskaya, Alla  and
      Roth, Dan",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.14",
    pages = "114--120",
    abstract = "We present our system that participated in the shared task on the grammatical error correction of Ukrainian. We have implemented two approaches that make use of large pre-trained language models and synthetic data, that have been used for error correction of English as well as low-resource languages. The first approach is based on fine-tuning a large multilingual language model (mT5) in two stages: first, on synthetic data, and then on gold data. The second approach trains a (smaller) seq2seq Transformer model pre-trained on synthetic data and fine-tuned on gold data. Our mT5-based model scored first in {``}GEC only{''} track, and a very close second in the {``}GEC+Fluency{''} track. Our two key innovations are (1) finetuning in stages, first on synthetic, and then on gold data; and (2) a high-quality corruption method based on roundtrip machine translation to complement existing noisification approaches.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
generate.sh		generate.sh
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

generate.py

generate.py

generate.sh

generate.sh

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

A Low-Resource Approach to the Grammatical Error Correction of Ukranian

Getting Started

Training

Inference

Citation

About

Releases

Packages

Languages

License

knarfamlap/low-resource-gec-uk

Folders and files

Latest commit

History

Repository files navigation

A Low-Resource Approach to the Grammatical Error Correction of Ukranian

Getting Started

Training

Inference

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages