A set of processing pipelines for Maltese mapping tokens to Arabic transliterations, translations, or to original token.
This repository used to contain code for Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect.
For a snapshot of the code for replication purposes refer to the 2023.cawl-1.4
tag.
The current code contains improvements as detailed in Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching. A summary of changes from the previous work:
- Transliteration character mapping updates: added
t
→ث
, digits, & other miscellaneous symbols. Also fixed a bug which wasn't generatingظ
/ث
/أ
characters. - Word Etymology data.
- Etymology Classification code & classifier.
- Pre-computed word-level translations using Google Translate.
- Updated the transliteration pipeline to allow for translations, passing as is, & mixing decisions using the etymology classifier.
In a virtual environment install the dependencies:
pip install -r requirements.txt
Note that this might not work on Windows & might have to use WSL.
The word/character ranking models can be obtained from: https://github.com/CAMeL-Lab/HierarchicalArabicDialectID.
The sub-tokens count model is a reference to tokenizers compatible with transformers
.
A script intended to process entire datasets in one pass.
Execute python process.py -h
to access the documentation.
Transliteration (Xara)
To perform transliteration, specify the transliterate
parameter as well as any other additional parameters:
python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
--transliterate \
--rankers word_model_score character_model_score \
--ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
--token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map
This performs the Tara pipeline, which is transliteration using the full token mappings & the non-deterministic character mappings with Tunisian word model score ranking. Refer to transliterate.sh which transliterates a given dataset in all configurations from Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect.
(Word) Translation (T*)
To perform word-level translation, specify the translate
parameter.
For instance, to apply the Ten pipeline (English translation):
python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
--translate \
--translation_system "mt-en"
where translation_system
corresponds to one of the translation files.
Partial Transliteration (Xara/P)
When specifying etymology tags with the transliterate
/translate
parameter, partial transliteration/translation is performed.
This uses an etymology_model
to predict the etymology of the word before applying the action specified.
A "pass" (leaving the token as is) is performed for any token with an etymology tag unspecified in these parameters.
python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
--etymology_model="etymology_data/model.pickle" \
--transliterate "Arabic" \
--rankers word_model_score character_model_score \
--ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
--token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map
Transliteration & Translation Mixing (Xara/T*)
Specify the etymology tags (corresponding to those predicted by the etymology_model
) with the transliterate
& translate
parameters, mixes transliteration with translation at the token level.
For instance, to apply the Xara/Teng pipeline:
python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
--etymology_model="etymology_data/model.pickle" \
--transliterate "Arabic" \
--translate "Non-Arabic" "Name" \
--rankers word_model_score character_model_score \
--ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
--token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map \
--translation_system "mt-en"
Multiple translation systems can also be specified, for each etymology tag specified in the translate
argument.
For instance, to apply the Xara/Tara pipeline:
python process.py ${dataset} ${INPUT_PATH} ${OUTPUT_PATH} \
--etymology_model="etymology_data/model.pickle" \
--transliterate "Arabic" "Symbol" \
--translate "Non-Arabic" "Code-Switching" "Name" \
--rankers word_model_score character_model_score \
--ranker_models "../models/aggregated_country/lm/word/tn-maghreb.arpa" "../models/aggregated_country/lm/char/tn-maghreb.arpa" \
--token_mappings mappings/small_closed_class.map mappings/additional_closed_class.map \
--translation_system "mt-ar" "en-ar" "mt-ar"
Refer to the demo notebook for examples.
The latest version of this work is published under:
@misc{micallef-etal-2024-maltese-etymology,
title = "Cross-Lingual Transfer from Related Languages: Treating Low-Resource {M}altese as Multilingual Code-Switching",
author = "Micallef, Kurt and
Habash, Nizar and
Borg, Claudia and
Eryani, Fadhl and
Bouamor, Houda",
editor = "Graham, Yvette and
Purver, Matthew",
booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.eacl-long.61",
pages = "1014--1025",
}
The original transliteration system was published under:
@inproceedings{micallef-etal-2023-maltese-transliteration,
title = "Exploring the Impact of Transliteration on {NLP} Performance: Treating {M}altese as an {A}rabic Dialect",
author = "Micallef, Kurt and
Eryani, Fadhl and
Habash, Nizar and
Bouamor, Houda and
Borg, Claudia",
editor = "Gorman, Kyle and
Sproat, Richard and
Roark, Brian",
booktitle = "Proceedings of the Workshop on Computation and Written Language (CAWL 2023)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.cawl-1.4",
doi = "10.18653/v1/2023.cawl-1.4",
pages = "22--32",
}
For fine-tuning instructions & dataset references see: https://github.com/MLRS/BERTu/tree/2022.deeplo-1.10/finetune