Data for the DIALECT-COPA shared task of dialectal causal commonsense reasoning

This repository contains the training and validation data for the DIALECT-COPA shared task.

Each folder consists of training (400 instances, train.jsonl) and validation (100 instances, val.jsonl) data from the original COPA dataset, human-translated in the specific language or dialect. Languages that (also) use the Cyrillic alphabet have additional files transliterated into the Latin alphabet (train.trans.jsonl and val.trans.jsonl).

Bear in mind that each dialect is related primarily to the corresponding standard language, but that standard languages are very much related between themselves as well.

copa-sl - Slovenian language
copa-sl-cer - The Cerkno dialect of the Slovenian language
copa-hr - Croatian language
copa-sr - Serbian language
copa-sr-tor - The Torlak dialect of the Serbian, Macedonian, Bulgarian language
copa-mk - Macedonian language

In the testing phase of the shared task, test data from the copa-sl-cer and the copa-sr-tor datasets will be shared with the participants, along with the test data of the copa-hr-ckm dataset, the surprise dialect of the shared task - the Chakavian dialect of the Croatian language.

Examples from the datasets

For participants to get a feeling on the level of diversity in the data, below we are giving a few examples. Note that here also an example of the "surprise" Chakavian Croatian dialect is given.

Premise and correct alternative from the first instance in the validation dataset

English: The man turned on the faucet. Water flowed from the spout.
Slovenian: Moški je odprl pipo. Iz ustja pipe je pritekla voda.
Cerkno dialect: Dic je adparu pipa. Iz pipe je partjekla uoda.
Croatian: Muškarac je otvorio slavinu. Voda je potekla iz mlaznice.
Chakavian dialect: Muški je otpra špino. Oda je počela teć z mlaznici.
Serbian: Човек је отворио славину. Вода је текла из славине.
Serbian (transliterated): Čovek je otvorio slavinu. Voda je tekla iz slavine.
Torlak dialect: Човек одврнуја славину. Вода истичала од славину.
Torlak dialect (transliterated): Čovek odvrnuja slavinu. Voda ističala od slavinu.
Macedonian: Човекот ја отвори славината. Истече вода од славината.
Macedonian (transliterated): Čovekot ja otvori slavinata. Isteče voda od slavinata.

Premise and correct alternative from the second instance in the validation dataset

English: The girl found a bug in her cereal. She lost her appetite.
Slovenian: Dekle je v kosmičih našlo žuželko. Izgubila je apetit.
Cerkno dialect: Zjala je najdla hruošče u kosmičih. Zgubila je apetit.
Croatian: Djevojka je pronašla kukca u žitaricama. Izgubila je apetit.
Chakavian dialect: Mlada je našla neko blago va žitaricah. Je zgubila tiek.
Serbian: Девојчица је пронашла бубу у житарицама. Изгубила је апетит.
Serbian (transliterated): Devojčica je pronašla bubu u žitaricama. Izgubila je apetit.
Torlak dialect: Девојчица нашла бубаљку међу њојне житарице. Изгубила си апетит.
Torlak dialect (transliterated): Devojčica našla bubaljku među njojne žitarice. Izgubila si apetit.
Macedonian: Девојката пронајде бубачка во нејзините житарки. Изгуби апетит.
Macedonian (transliterated): Devojkata pronajde bubačka vo nejzinite žitarki. Izgubi apetit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copa-en

copa-en

copa-hr-ckm

copa-hr-ckm

copa-hr

copa-hr

copa-mk

copa-mk

copa-sl-cer

copa-sl-cer

copa-sl

copa-sl

copa-sr-tor

copa-sr-tor

copa-sr

copa-sr

README.md

README.md

Repository files navigation

Data for the DIALECT-COPA shared task of dialectal causal commonsense reasoning

Examples from the datasets

Premise and correct alternative from the first instance in the validation dataset

Premise and correct alternative from the second instance in the validation dataset

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
copa-en		copa-en
copa-hr-ckm		copa-hr-ckm
copa-hr		copa-hr
copa-mk		copa-mk
copa-sl-cer		copa-sl-cer
copa-sl		copa-sl
copa-sr-tor		copa-sr-tor
copa-sr		copa-sr
README.md		README.md

clarinsi/dialect-copa

Folders and files

Latest commit

History

Repository files navigation

Data for the DIALECT-COPA shared task of dialectal causal commonsense reasoning

Examples from the datasets

Premise and correct alternative from the first instance in the validation dataset

Premise and correct alternative from the second instance in the validation dataset

About

Resources

Stars

Watchers

Forks