WikiBio (wikipedia biography dataset)

This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). It was used in our work,

Neural Text Generation from Structured Data with Application to the Biography Domain
Rémi Lebret, David Grangier and Michael Auli, EMNLP 16,
http://arxiv.org/abs/1603.07771

This publication provides further information about the data and we kindly ask you to cite this paper when using the data. The data was extracted from the English wikipedia dump (enwiki-20150901) relying on the articles refered by WikiProject Biography [1].

For each article, we extracted the first paragraph (text), the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP [2] to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%). We strongly recommend using test only for the final evaluation.

The data is organised in three subdirectories for train, valid and test. Each directory contains 7 files

SET.id contains the list of wikipedia ids, one article per line.
SET.url contains the url of the wikipedia articles, one article per line.
SET.box contains the infobox data, one article per line.
SET.nb contains the number of sentences per article, one article per line.
SET.sent contains the sentences, one sentence per line.
SET.title contains the title of the wikipedia article, one per line.
SET.contributors contains the url of the wikipedia article history, which list the authors of the article.

Hence all the file allows to access the information for one article relying on line numbers. It is necessary to use SET.nb to split the sentences (SET.sent) per article. The format for encoding the infobox data SET.box follows the following scheme: each line encode one box, each box is encoded as a list of tab separated tokens, each token has the following form fieldname_position:wordtype. We also indicates when a field is empty or contains no readable tokens with fieldname:. For instance the first box of the valid set starts with

type_1:pope name_1:michael name_2:iii name_3:of name_4:alexandria title_1:56th title_2:pope title_3:of title_4:alexandria title_5:& title_6:patriarch title_7:of title_8:the title_9:see title_10:of title_11:st. title_12:mark image:

which indicates that the field "type" contains 1 token "pope", the field "name" contains 4 tokens "michael iii of alexandria", the field "title" contains 12 tokens "56th pope of alexandria & patriarch of the see of st. mark", the field "image" is empty.

[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Biography
[2] http://stanfordnlp.github.io/CoreNLP/

Version Information

v1.0 (this version) Initial Release.

License

License information is provided in License.txt

Decompressing zip files

We splitted the archive in multiple files. To extract, run
cat wikipedia-biography-dataset.z?? > tmp.zip
unzip tmp.zip
rm tmp.zip

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
README.txt		README.txt
wikipedia-biography-dataset.z00		wikipedia-biography-dataset.z00
wikipedia-biography-dataset.z01		wikipedia-biography-dataset.z01
wikipedia-biography-dataset.z02		wikipedia-biography-dataset.z02
wikipedia-biography-dataset.z03		wikipedia-biography-dataset.z03
wikipedia-biography-dataset.z04		wikipedia-biography-dataset.z04
wikipedia-biography-dataset.z05		wikipedia-biography-dataset.z05
wikipedia-biography-dataset.z06		wikipedia-biography-dataset.z06
wikipedia-biography-dataset.z07		wikipedia-biography-dataset.z07
wikipedia-biography-dataset.z08		wikipedia-biography-dataset.z08
wikipedia-biography-dataset.z09		wikipedia-biography-dataset.z09
wikipedia-biography-dataset.z10		wikipedia-biography-dataset.z10
wikipedia-biography-dataset.z11		wikipedia-biography-dataset.z11
wikipedia-biography-dataset.z12		wikipedia-biography-dataset.z12
wikipedia-biography-dataset.z13		wikipedia-biography-dataset.z13
wikipedia-biography-dataset.z14		wikipedia-biography-dataset.z14
wikipedia-biography-dataset.z15		wikipedia-biography-dataset.z15

License

DavidGrangier/wikipedia-biography-dataset

Folders and files

Latest commit

History

Repository files navigation

WikiBio (wikipedia biography dataset)

Version Information

License

Decompressing zip files

About

Resources

License

Stars

Watchers

Forks