Skip to content

MLRS/malti

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maltese Text Processing

A set of processing pipelines for Maltese mapping tokens to Arabic transliterations, translations, or to original token.

This repository used to contain code for Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect. For a snapshot of the code for replication purposes refer to the 2023.cawl-1.4 tag.

The current code contains improvements as detailed in Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching. A summary of changes from the previous work:

Usage

Installation

In a virtual environment install the dependencies:

pip install -r requirements.txt

Note that this might not work on Windows & might have to use WSL.

The word/character ranking models can be obtained from: https://github.com/CAMeL-Lab/HierarchicalArabicDialectID. The sub-tokens count model is a reference to tokenizers compatible with transformers.

Command line

A script intended to process entire datasets in one pass. Execute python process.py -h to access the documentation.

Transliteration (Xara)

To perform transliteration, specify the transliterate parameter as well as any other additional parameters:

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --transliterate \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map

This performs the Tara pipeline, which is transliteration using the full token mappings & the non-deterministic character mappings with Tunisian word model score ranking. Refer to transliterate.sh which transliterates a given dataset in all configurations from Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect.

(Word) Translation (T*)

To perform word-level translation, specify the translate parameter. For instance, to apply the Ten pipeline (English translation):

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --translate \
  --translation_system "mt-en"

where translation_system corresponds to one of the translation files.

Partial Transliteration (Xara/P)

When specifying etymology tags with the transliterate/translate parameter, partial transliteration/translation is performed. This uses an etymology_model to predict the etymology of the word before applying the action specified. A "pass" (leaving the token as is) is performed for any token with an etymology tag unspecified in these parameters.

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --etymology_model="etymology_data/model.pickle" \
  --transliterate "Arabic" \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map
Transliteration & Translation Mixing (Xara/T*)

Specify the etymology tags (corresponding to those predicted by the etymology_model) with the transliterate & translate parameters, mixes transliteration with translation at the token level. For instance, to apply the Xara/Teng pipeline:

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --etymology_model="etymology_data/model.pickle" \
  --transliterate "Arabic" \
  --translate "Non-Arabic" "Name" \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map \
  --translation_system "mt-en"

Multiple translation systems can also be specified, for each etymology tag specified in the translate argument. For instance, to apply the Xara/Tara pipeline:

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --etymology_model="etymology_data/model.pickle" \
  --transliterate "Arabic" "Symbol" \
  --translate "Non-Arabic" "Code-Switching" "Name" \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map \
  --translation_system "mt-ar" "en-ar" "mt-ar"

Python Code

Refer to the demo notebook for examples.

Citations

The latest version of this work is published under:

@misc{micallef-etal-2024-maltese-etymology,
    title = "Cross-Lingual Transfer from Related Languages: Treating Low-Resource {M}altese as Multilingual Code-Switching",
    author = "Micallef, Kurt  and
              Habash, Nizar  and
              Borg, Claudia  and
              Eryani, Fadhl  and
              Bouamor, Houda",
    editor = "Graham, Yvette  and
              Purver, Matthew",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.eacl-long.61",
    pages = "1014--1025",
}

The original transliteration system was published under:

@inproceedings{micallef-etal-2023-maltese-transliteration,
    title = "Exploring the Impact of Transliteration on {NLP} Performance: Treating {M}altese as an {A}rabic Dialect",
    author = "Micallef, Kurt  and
              Eryani, Fadhl  and
              Habash, Nizar  and
              Bouamor, Houda  and
              Borg, Claudia",
    editor = "Gorman, Kyle  and
              Sproat, Richard  and
              Roark, Brian",
    booktitle = "Proceedings of the Workshop on Computation and Written Language (CAWL 2023)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.cawl-1.4",
    doi = "10.18653/v1/2023.cawl-1.4",
    pages = "22--32",
}

For fine-tuning instructions & dataset references see: https://github.com/MLRS/BERTu/tree/2022.deeplo-1.10/finetune