Mukayese: Turkish NLP Strikes Back

Turkish Natural Language Processing is left behind in developing state-of-the-art systems due to a lack of organized benchmarks and baselines. We fill this gap with Mukayese (Turkish word for "comparison/benchmarking"), an extensive set of datasets and benchmarks for several Turkish NLP tasks. All of the datasets and code have been made public in this repository.

Ali Safaya, Emirhan Kurtuluş, Arda Goktogan, and Deniz Yuret. 2022. Mukayese: Turkish NLP Strikes Back. In Findings of the Association for Computational Linguistics: ACL 2022, pages 846–863, Dublin, Ireland. Association for Computational Linguistics.

Updates

(25/05/2022) Paper is accepted to Findings of ACL'22.
(22/03/2022) Summarization models are online on Huggingface! Download here
(01/03/2022) Paper is on ArXiv. View here.
(25/02/2022) Datasets have been made available through pre-release v0.0.1

What to do with Mukayese ?

With Mukayese, researchers of Turkish NLP will be able to:

Compare the performance of existing methods in leaderboards.
Access existing implementations of NLP baselines.
Evaluate their own methods on the relevant test datasets.
Submit their own work to be enlisted in our leaderboards.

Mukayese's Mission

The most important goal of Mukayese is to standardize the comparison and evaluation of Turkish NLP methods. As a result of the lack of a platform for benchmarking, Turkish NLP researchers struggle with comparing their models to the existing ones.

Maintainers

Ali Safaya - @alisafaya
Emirhan Kurtuluş - @ekurtulus
Arda Göktoğan - @ardofski

Mukayese Tasks

We collect our documentation for reproducing the baselines for Mukayese in this repository. Baselines are listed according to each task below:

Language Modeling

Datasets

Baselines

Machine Translation (EN/TR):

Datasets

WMT16
MuST-C

Baselines

Named Entity Recognition

Datasets

Baselines

Sentence Segmentation

Datasets

trseg-41

Baselines

Spell-checking and Correction

Datasets

trspell-10

Baselines

Summarization

Datasets

trsum

Baselines

Download trained models here

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("mukayese/mt5-base-turkish-sum")
model = AutoModelForSeq2SeqLM.from_pretrained("mukayese/mt5-base-turkish-sum")

article = """Fransız devi PSG'nin üzerindeki kara bulutlar dağılmıyor. 
Devler Ligi'nde Real Madrid'e olaylı şekilde boyun eğen başkent temsilcisinde oyuncuların gruplaşmaya başladığı öne sürüldü. 
Güney Amerikalılar ve Fransızca konuşanlar olarak ikiye ayrılan oyuncuların saha içerisinde de birbirlerine uzak olduğu iddia edildi. 
İşte PSG'de soyunma odasında yaşananlar ve 20 milyon avroluk tazminat ihtimali... 
UEFA Şampiyonlar Ligi'nde Real Madrid'e sansasyonel bir şekilde elenen Paris Saint Germain'de Kylian Mbappe haricindeki tüm oyunculara yönelik taraftar tepkisinin devam etmesi başkent temsilcisindeki krizi derinleştirdi.
RMC Sport'ta yer alan haberde;
Paris Saint Germain'in soyunma odasında işlerin yolunda gitmediği ve futbolcuların iki gruba ayrıldığı öne sürüldü. İddiaya göre oyuncular gruplaşmaya başladı ve aralarındaki iletişim her geçen gün zayıflıyor."""

inputs = tokenizer([article], max_length=1024, return_tensors="pt")
summary_ids = model.generate(inputs["input_ids"], num_beams=6, max_length=100)
tokenizer.batch_decode(summary_ids, skip_special_tokens=True)[0]

>>> "UEFA Şampiyonlar Ligi'nde Real Madrid'e olaylı şekilde boyun eğen Paris Saint Germain'de oyuncuların gruplaşmaya başladığı öne sürüldü."

Text Classification

Datasets

Baselines

Citation

@inproceedings{safaya-etal-2022-mukayese,
    title = "Mukayese: {T}urkish {NLP} Strikes Back",
    author = "Safaya, Ali  and
      Kurtulu{\c{s}}, Emirhan  and
      Goktogan, Arda  and
      Yuret, Deniz",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-acl.69",
    doi = "10.18653/v1/2022.findings-acl.69",
    pages = "846--863"
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
app		app
language-modeling		language-modeling
machine-translation		machine-translation
ner		ner
pos-tagging		pos-tagging
segmentation		segmentation
spell-checking		spell-checking
summarization		summarization
text-classification		text-classification
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cover.png		cover.png

License

alisafaya/mukayese

Folders and files

Latest commit

History

Repository files navigation

Mukayese: Turkish NLP Strikes Back

Updates

What to do with Mukayese ?

Mukayese's Mission

Maintainers

Mukayese Tasks

Language Modeling

Machine Translation (EN/TR):

Named Entity Recognition

Sentence Segmentation

Spell-checking and Correction

Summarization

Text Classification

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages