Skip to content

Code for reproducing the paper Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction to appear at The 7th Workshop on Noisy User-generated Text (W-NUT) organized at EMNLP 2021.

License

Notifications You must be signed in to change notification settings

twitter-research/multilingual-alignment-tpp

Repository files navigation

Open All Collab Binder arXiv Poster Slides YouTube Video Views

image

Code for reproducing the paper Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction to appear at The 7th Workshop on Noisy User-generated Text (W-NUT) organized at EMNLP 2021.

Abstract

We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language pairs where transfer learning is difficult for mBERT: those where source and target languages are different in script, vocabulary, and linguistic typology. We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese on two social media tasks: NER (a 37% average relative improvement in F1 across target languages) and sentiment classification (12% relative improvement in F1) on social media text, while also benchmarking on a non-social media task of Universal Dependency POS tagging (6.7% relative improvement in accuracy). Our results are promising given the lack of social media bitext corpus.

Citation

Please cite as:

Mishra, S., & Haghighi, A. (2021). Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction. Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021). arXiv

@inproceedings{mishra-haghighi-2021-improved,
   title = "Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction",
   author = "Mishra, Shubhanshu  and
     Haghighi, Aria",
   booktitle = "Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)",
   month = nov,
   year = "2021",
   address = "Online",
   publisher = "Association for Computational Linguistics",
   url = "https://aclanthology.org/2021.wnut-1.42",
   pages = "381--388",
   eprint={2110.10318},    
   abstract = "We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language pairs where transfer learning is difficult for mBERT: those where source and target languages are different in script, vocabulary, and linguistic typology. We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese on two social media tasks: NER (a 37{\%} average relative improvement in F1 across target languages) and sentiment classification (12{\%} relative improvement in F1) on social media text, while also benchmarking on a non-social media task of Universal Dependency POS tagging (6.7{\%} relative improvement in accuracy). Our results are promising given the lack of social media bitext corpus. Our code can be found at: https://github.com/twitter-research/multilingual-alignment-tpp.",
}
@inproceedings{mishra2021tpp,
 title={Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction},
 author={Mishra, Shubhanshu and Haghighi, Aria},
 booktitle={Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021)},
 year={2021},
 address={Online},
 publisher={Association for Computational Linguistics},
 pages={1--8},
 eprint={2110.10318},
 archivePrefix={arXiv},
 primaryClass={cs.CL}
}

Reproducibility

Following steps allow reproducing experiments in the paper:

  1. Run mBERT finetuning
  2. Fine-tune on specific task (NER, POS, Sentiment).

Both steps can be run via files in ./notebooks/.

More details in the paper.

Datasets

We provide example formats of the datasets in the /data folder. The NER data for English, Arabic, and Japanese is internal. Details for processing data can be found in ./src folder.

Security Issues?

Please report sensitive security issues via Twitter's bug-bounty program (https://hackerone.com/twitter) rather than GitHub.

About

Code for reproducing the paper Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction to appear at The 7th Workshop on Noisy User-generated Text (W-NUT) organized at EMNLP 2021.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published