Skip to content

Commit

Permalink
XGLM paper camera-ready: add XStoryCloze data opensource (#4820)
Browse files Browse the repository at this point in the history
* add XStoryCloze data

* upload XStoryCloze dataset files to s3 instead of git

* minor fixes

* minor fixes

* minor fixes

* minor fixes

* fix broken dataset doc link
  • Loading branch information
todpole3 committed May 8, 2023
1 parent 3f6ba43 commit b35e8ef
Show file tree
Hide file tree
Showing 3 changed files with 75 additions and 4 deletions.
20 changes: 17 additions & 3 deletions examples/xglm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,24 @@ for lang in ['en', 'zh', 'hi']:
# hi-1 0 0
```

## Preprint
[Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668).
## XStoryCloze

We release XStoryCloze, a new multilingual dataset intended for few-shot evaluation, alongside this paper. XStoryCloze consists of professional translation of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 other languages. It is opensourced under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode), the same license as the English StoryCloze.

You can download the dataset via [this link](https://dl.fbaipublicfiles.com/xstorycloze.zip).

Language | ar | es | eu | hi | id | my | ru | sw | te | zh
---|---|---|---|---|---|---|---|---|---|---
Train size | 360 | 360 | 360 | 360 | 360 | 360 | 360 | 360 | 360 | 360
Eval size | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511

Please refer to [the dataset doc](XStoryCloze.md) for more information.


## Publication
[Few-shot Learning with Multilingual Generative Language Models](https://arxiv.org/abs/2112.10668).
Xi Victoria Lin*, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li* (* Equal Contribution).
ArXiv 2021.
EMNLP 2022.

## Citation
```
Expand Down
57 changes: 57 additions & 0 deletions examples/xglm/XStoryCloze.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
XStoryCloze consists of professional translation of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 other languages. This dataset is released by Meta AI alongside the paper [Few-shot Learning with Multilingual Generative Language Models. EMNLP 2022](https://arxiv.org/abs/2112.10668).

# Languages
ru, zh (Simplified), es (Latin America), ar, hi, id, te, sw, eu, my.

# Data Splits
This dataset is intended to be used for evaluating the zero- and few-shot learning capabilities of multlingual language models. We split the data for each language into train and test (360 vs. 1510 examples, respectively). The released data files for different languages maintain a line-by-line alignment.

# Access English StoryCloze
Please request the original English StoryCloze dataset through the [official website](https://cs.rochester.edu/nlp/rocstories/). You can create a split of the en data following our data split scheme using the following commands:
```
head -361 spring2016.val.tsv > spring2016.val.en.tsv.split_20_80_train.tsv
head -1 spring2016.val.tsv > spring2016.val.en.tsv.split_20_80_eval.tsv # TSV header
tail -1511 spring2016.val.tsv >> spring2016.val.en.tsv.split_20_80_eval.tsv
```

# Licence
XStoryCloze is opensourced under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode), the same license as the original English StoryCloze.

# Citation
We hope this dataset is helpful for the research and wider NLP community. If you use XStoryCloze in your work, please cite
```
@article{DBLP:journals/corr/abs-2112-10668,
author = {Xi Victoria Lin and
Todor Mihaylov and
Mikel Artetxe and
Tianlu Wang and
Shuohui Chen and
Daniel Simig and
Myle Ott and
Naman Goyal and
Shruti Bhosale and
Jingfei Du and
Ramakanth Pasunuru and
Sam Shleifer and
Punit Singh Koura and
Vishrav Chaudhary and
Brian O'Horo and
Jeff Wang and
Luke Zettlemoyer and
Zornitsa Kozareva and
Mona T. Diab and
Veselin Stoyanov and
Xian Li},
title = {Few-shot Learning with Multilingual Language Models},
journal = {CoRR},
volume = {abs/2112.10668},
year = {2021},
url = {https://arxiv.org/abs/2112.10668},
eprinttype = {arXiv},
eprint = {2112.10668},
timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
2 changes: 1 addition & 1 deletion examples/xglm/model_card.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of t

### XStoryCloze
#### Description
A new dataset created by Meta AI by translating the validation split of the English StoryCloze dataset (Mostafazadeh et al., 2016) (Spring 2016 version) to 10 other typologically diverse languages (ru, zh Simplified, es Latin America, ar, hi, id, te, sw, eu, my).
A new dataset created by Meta AI along side this work by translating the validation split of the English StoryCloze dataset (Mostafazadeh et al., 2016) (Spring 2016 version) to 10 other typologically diverse languages (ru, zh Simplified, es Latin America, ar, hi, id, te, sw, eu, my).

### XCOPA (Ponti et al., 2020)
#### Description
Expand Down

0 comments on commit b35e8ef

Please sign in to comment.