Skip to content

hangyav/lowresCCWR

Repository files navigation

Improving Low-Resource Languages in Pre-Trained Multilingual Language Models

Method

Install

pip install -r requirements.txt

Run

See scripts/ for examples scripts:

  • ./scripts/example_vocab_extend.sh: for building vocabulary extended LMs (eBERT)
  • ./scripts/example_unsup_alignment.sh: for unsupervised mining based alignment and evaluation
  • ./scripts/example_ner.sh: for NER evaluation
  • ./scripts/mine_word_pairs.sh: for mining word pairs from a given source and target dataset pair

For further details please look at the relevant scripts and the parameter descriptions in the used python scripts.

External Resources and Tools

Data

We used the following datasets for the experiments:

Tokenizers

The datasets are tokenized with the following tools:

  • English: Moses Tokenizer: https://github.com/moses-smt/mosesdecoder
  • Nepali: Tokenizer and Normalizer from indic-nlp: https://github.com/anoopkunchukuttan/indic_nlp_library
  • Swahili: Moses Tokenizer
  • Malayalam: Tokenizer and Normalizer from indic-nlp
  • Sinhala: Tokenizer and Normalizer from indic-nlp
  • Maori: Moses Tokenizer
  • Sindhi: Moses Tokenizer
  • Amharic: https://github.com/uhh-lt/amharicprocessor
  • Gujrati: Tokenizer and Normalizer from indic-nlp
  • Kannada: Tokenizer and Normalizer from indic-nlp
  • Bengali: Tokenizer and Normalizer from indic-nlp
  • Afrikaans: Moses Tokenizer
  • Macedonian: Moses Tokenizer
  • Basque: Moses Tokenizer
  • Bulgarian: Moses Tokenizer
  • Nepali: Tokenizer and Normalizer from indic-nlp

Cite

Related publications:

[1] Viktor Hangya, Hossain Shaikh Saadi, and Alexander Fraser. 2022. Improving Low-Resource Languages in Pre-Trained Multilingual Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11993–12006

@inproceedings{hangya-etal-2022-improving,
    title = "Improving Low-Resource Languages in Pre-Trained Multilingual Language Models",
    author = "Hangya, Viktor  and
      Saadi, Hossain Shaikh  and
      Fraser, Alexander",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.822",
    pages = "11993--12006",
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published