Skip to content

ayyoobimani/GLP-POS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. This repository contains the code, data, and trained models for the paper Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging. The paper proposes a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low resource languages.

Reproduce the results in the paper

If you only intend to reproduce the final results download the trained models from our website and use the predict.py file to run them.

python3 predict.py --gpu 0 --model model_path --test test_path

The ground truth files used in our paper are available in TEST directory. These files come from the Universal Dependencies project, modified as described in our paper to accomodate the multiword tokens problem.

Train Part-of-Speech taggers using our generated data

Use train_pos_tagger.py file to train POS taggers for different languages. The training data generated by our GLP models are available in GLP1 and GLP2 directories.

The code is Tested with Python 3.6, Transformers 4.15.0, Torch 1.10.1.

Install Flair with its dependencies (see their repo):

pip install flair

Download our modified version of Flair and substitute the orginal Flair folder with ours.

Train POS model example:

python3 train_pos_tagger.py --lang por --gpu 5 --train train_file.connlu --test test_file.conllu --epochs 30

Prediction and evaluation of POS model example:

python3 predict.py --gpu 0 --model model_path --test test_path

Train the the GLP models

To train the GLP models use the code available in GLP_code directory.

Publication

If you use the code, please cite

@misc{https://doi.org/10.48550/arxiv.2210.09840,
  doi = {10.48550/ARXIV.2210.09840},  
  url = {https://arxiv.org/abs/2210.09840},
  author = {Imani, Ayyoob and Severini, Silvia and Sabet, Masoud Jalili and Yvon, François and Schütze, Hinrich},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

License

A full copy of the license can be found in LICENSE.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages