Maltese Text Processing

A set of processing pipelines for Maltese mapping tokens to Arabic transliterations, translations, or to original token.

This repository used to contain code for Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect. For a snapshot of the code for replication purposes refer to the 2023.cawl-1.4 tag.

The current code contains improvements as detailed in Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching. A summary of changes from the previous work:

Transliteration character mapping updates: added t→ث, digits, & other miscellaneous symbols. Also fixed a bug which wasn't generating ظ/ث/أ characters.
Word Etymology data.
Etymology Classification code & classifier.
Pre-computed word-level translations using Google Translate.
Updated the transliteration pipeline to allow for translations, passing as is, & mixing decisions using the etymology classifier.

Usage

Installation

In a virtual environment install the dependencies:

pip install -r requirements.txt

Note that this might not work on Windows & might have to use WSL.

The word/character ranking models can be obtained from: https://github.com/CAMeL-Lab/HierarchicalArabicDialectID. The sub-tokens count model is a reference to tokenizers compatible with transformers.

Command line

A script intended to process entire datasets in one pass. Execute python process.py -h to access the documentation.

Transliteration (X_ara)

To perform transliteration, specify the transliterate parameter as well as any other additional parameters:

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --transliterate \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map

This performs the T_ara pipeline, which is transliteration using the full token mappings & the non-deterministic character mappings with Tunisian word model score ranking. Refer to transliterate.sh which transliterates a given dataset in all configurations from Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect.

(Word) Translation (T_*)

To perform word-level translation, specify the translate parameter. For instance, to apply the T_en pipeline (English translation):

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --translate \
  --translation_system "mt-en"

where translation_system corresponds to one of the translation files.

Partial Transliteration (X_ara/P)

When specifying etymology tags with the transliterate/translate parameter, partial transliteration/translation is performed. This uses an etymology_model to predict the etymology of the word before applying the action specified. A "pass" (leaving the token as is) is performed for any token with an etymology tag unspecified in these parameters.

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --etymology_model="etymology_data/model.pickle" \
  --transliterate "Arabic" \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map

Transliteration & Translation Mixing (X_ara/T_*)

Specify the etymology tags (corresponding to those predicted by the etymology_model) with the transliterate & translate parameters, mixes transliteration with translation at the token level. For instance, to apply the X_ara/T_eng pipeline:

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --etymology_model="etymology_data/model.pickle" \
  --transliterate "Arabic" \
  --translate "Non-Arabic" "Name" \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map \
  --translation_system "mt-en"

Multiple translation systems can also be specified, for each etymology tag specified in the translate argument. For instance, to apply the X_ara/T_ara pipeline:

python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
  --etymology_model="etymology_data/model.pickle" \
  --transliterate "Arabic" "Symbol" \
  --translate "Non-Arabic" "Code-Switching" "Name" \
  --rankers word_model_score character_model_score \
  --ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
  --token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map \
  --translation_system "mt-ar" "en-ar" "mt-ar"

Python Code

Refer to the demo notebook for examples.

Citations

The latest version of this work is published under:

@misc{micallef-etal-2024-maltese-etymology,
    title = "Cross-Lingual Transfer from Related Languages: Treating Low-Resource {M}altese as Multilingual Code-Switching",
    author = "Micallef, Kurt  and
              Habash, Nizar  and
              Borg, Claudia  and
              Eryani, Fadhl  and
              Bouamor, Houda",
    editor = "Graham, Yvette  and
              Purver, Matthew",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.eacl-long.61",
    pages = "1014--1025",
}

The original transliteration system was published under:

@inproceedings{micallef-etal-2023-maltese-transliteration,
    title = "Exploring the Impact of Transliteration on {NLP} Performance: Treating {M}altese as an {A}rabic Dialect",
    author = "Micallef, Kurt  and
              Eryani, Fadhl  and
              Habash, Nizar  and
              Bouamor, Houda  and
              Borg, Claudia",
    editor = "Gorman, Kyle  and
              Sproat, Richard  and
              Roark, Brian",
    booktitle = "Proceedings of the Workshop on Computation and Written Language (CAWL 2023)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.cawl-1.4",
    doi = "10.18653/v1/2023.cawl-1.4",
    pages = "22--32",
}

For fine-tuning instructions & dataset references see: https://github.com/MLRS/BERTu/tree/2022.deeplo-1.10/finetune

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
data/arabi_data		data/arabi_data
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/arabi_data

data/arabi_data

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Maltese Text Processing

Usage

Installation

Command line

Python Code

Citations

About

Releases

Packages

Contributors 2

Languages

License

MLRS/malti

Folders and files

Latest commit

History

Repository files navigation

Maltese Text Processing

Usage

Installation

Command line

Python Code

Citations

About

Resources

License

Stars

Watchers

Forks

Languages