XGLM paper camera-ready: add XStoryCloze data opensource (#4820)

* add XStoryCloze data * upload XStoryCloze dataset files to s3 instead of git * minor fixes * minor fixes * minor fixes * minor fixes * fix broken dataset doc link
facebookresearch · May 8, 2023 · b35e8ef · b35e8ef
1 parent 3f6ba43
commit b35e8ef
Show file tree

Hide file tree

Showing 3 changed files with 75 additions and 4 deletions.
diff --git a/examples/xglm/README.md b/examples/xglm/README.md
@@ -138,10 +138,24 @@ for lang in ['en', 'zh', 'hi']:
 # hi-1 0 0
 ```
 
-## Preprint
-[Few-shot Learning with Multilingual Language Models](https://arxiv.org/abs/2112.10668).
+## XStoryCloze
+
+We release XStoryCloze, a new multilingual dataset intended for few-shot evaluation, alongside this paper. XStoryCloze consists of professional translation of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 other languages. It is opensourced under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode), the same license as the English StoryCloze. 
+
+You can download the dataset via [this link](https://dl.fbaipublicfiles.com/xstorycloze.zip). 
+
+Language | ar | es | eu | hi | id | my | ru | sw | te | zh
+---|---|---|---|---|---|---|---|---|---|---
+Train size | 360 | 360 | 360 | 360 | 360 | 360 | 360 | 360 | 360 | 360  
+Eval size | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511 | 1511
+
+Please refer to [the dataset doc](XStoryCloze.md) for more information.
+
+
+## Publication
+[Few-shot Learning with Multilingual Generative Language Models](https://arxiv.org/abs/2112.10668).
 Xi Victoria Lin*, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin Stoyanov, Xian Li* (* Equal Contribution).
-ArXiv 2021.
+EMNLP 2022.
 
 ## Citation
 ```

diff --git a/examples/xglm/XStoryCloze.md b/examples/xglm/XStoryCloze.md
@@ -0,0 +1,57 @@
+XStoryCloze consists of professional translation of the [English StoryCloze dataset](https://cs.rochester.edu/nlp/rocstories/) (Spring 2016 version) to 10 other languages. This dataset is released by Meta AI alongside the paper [Few-shot Learning with Multilingual Generative Language Models. EMNLP 2022](https://arxiv.org/abs/2112.10668).
+
+# Languages
+ru, zh (Simplified), es (Latin America), ar, hi, id, te, sw, eu, my.
+
+# Data Splits
+This dataset is intended to be used for evaluating the zero- and few-shot learning capabilities of multlingual language models. We split the data for each language into train and test (360 vs. 1510 examples, respectively). The released data files for different languages maintain a line-by-line alignment.
+
+# Access English StoryCloze
+Please request the original English StoryCloze dataset through the [official website](https://cs.rochester.edu/nlp/rocstories/). You can create a split of the en data following our data split scheme using the following commands:
+```
+head -361 spring2016.val.tsv > spring2016.val.en.tsv.split_20_80_train.tsv
+
+head -1 spring2016.val.tsv > spring2016.val.en.tsv.split_20_80_eval.tsv   # TSV header
+tail -1511 spring2016.val.tsv >> spring2016.val.en.tsv.split_20_80_eval.tsv
+```
+
+# Licence
+XStoryCloze is opensourced under [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode), the same license as the original English StoryCloze.
+
+# Citation
+We hope this dataset is helpful for the research and wider NLP community. If you use XStoryCloze in your work, please cite
+```
+@article{DBLP:journals/corr/abs-2112-10668,
+  author    = {Xi Victoria Lin and
+               Todor Mihaylov and
+               Mikel Artetxe and
+               Tianlu Wang and
+               Shuohui Chen and
+               Daniel Simig and
+               Myle Ott and
+               Naman Goyal and
+               Shruti Bhosale and
+               Jingfei Du and
+               Ramakanth Pasunuru and
+               Sam Shleifer and
+               Punit Singh Koura and
+               Vishrav Chaudhary and
+               Brian O'Horo and
+               Jeff Wang and
+               Luke Zettlemoyer and
+               Zornitsa Kozareva and
+               Mona T. Diab and
+               Veselin Stoyanov and
+               Xian Li},
+  title     = {Few-shot Learning with Multilingual Language Models},
+  journal   = {CoRR},
+  volume    = {abs/2112.10668},
+  year      = {2021},
+  url       = {https://arxiv.org/abs/2112.10668},
+  eprinttype = {arXiv},
+  eprint    = {2112.10668},
+  timestamp = {Tue, 04 Jan 2022 15:59:27 +0100},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2112-10668.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+```
diff --git a/examples/xglm/model_card.md b/examples/xglm/model_card.md
@@ -50,7 +50,7 @@ The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of t
 
 ### XStoryCloze
 #### Description
-A new dataset created by Meta AI by translating the validation split of the English StoryCloze dataset (Mostafazadeh et al., 2016) (Spring 2016 version) to 10 other typologically diverse languages (ru, zh Simplified, es Latin America, ar, hi, id, te, sw, eu, my).
+A new dataset created by Meta AI along side this work by translating the validation split of the English StoryCloze dataset (Mostafazadeh et al., 2016) (Spring 2016 version) to 10 other typologically diverse languages (ru, zh Simplified, es Latin America, ar, hi, id, te, sw, eu, my).
 
 ### XCOPA (Ponti et al., 2020)
 #### Description