WikiSection Dataset

This dataset contains 38k full-text documents from English and German Wikipedia annotated with sections. Each sections contains two labels: the original unfolded section heading given by the Wikipedia editor (e.g. Treatment | Surgery), and a normalized section class label (e.g. disease.treatment).

The WikiSection task is to assign each sentence in the document a corresponding section class in which the sentence appears. The document text itself does not contain structure information, such as sections, subsections, paragraphs or headlines. Newline characters are included and occur between sections as well as inside sections.

The documents originate from Wikipedia dumps avaliable as CC BY-SA 3.0 at dumps.wikimedia.org/enwiki/20180101/ (English) and dumps.wikimedia.org/dewiki/20180101/ (German). The data sets are filtered by instances of Wikidata classes Q12136 (disease) and Q515 (city). Datasets are randomized and split into 70% training, 10% validation and 20% test documents.

Data Set Overview

The following table shows the characteristics of the four subsets:

dataset	en_disease	de_disease	en_city	de_city	total
language	English	German	English	German
instanceof	Q12136	Q12136	Q515	Q515

# docs	3,590	2,323	19,539	12,537	38k
# sections	27,838	14,784	133,642	65,907	242k
# sentences	209,885	106,198	1,104,619	500,100	1.9M

# heaadings	~8.5k	~6.1k	~23.0k	~12.2k
# classes	27	25	30	27
coverage	94.6%	89.5%	96.6%	96.1%

The numbers for docs, sections and sentences denote the total number of instances in the subset. headlines denotes the number of distinct section and subsection headings among documents. classes denotes the number of class labels after normalization and pruning the long tail. coverage denotes the proportion of sections covered by the class labels. The remaining sections are given the class other.

Data Format

The WikiSection dataset is provided in JSON format.

Field	Description
`type`	`city` or `disease`
`title`	Title of the article
`abstract`	Full text of the article abstract
`text`	Full text of the document body without abstract, headings and structure elements. Format: utf-8, not tokenized, includes newlines
`annotations`	List of sections
`class`	`SectionAnnotation`
`begin`	Begin offset in the document. Format: number of characters starting at 0. Newlines and escaped symbols count as one character
`length`	Length of the section. Format: number of characters
`sectionHeading`	Original heading of the section. Nested headings are unfolded and segmented by `
`sectionLabel`	Normalized class label

Alternatively, we provide the same data in a REF format that consists of one file per document and one sentence per line.

Example JSON

{
  "id" : "https://en.wikipedia.org/wiki/Autoimmune_polyendocrine_syndrome",
  "type" : "disease",
  "title" : "Autoimmune polyendocrine syndrome",
  "abstract" : "Autoimmune polyendocrine syndromes (APSs), also called [...]",
  "text" : "Each \"type\" of this condition has a different cause, in terms of [...]",
  "annotations" : [ {
    "class" : "SectionAnnotation",
    "begin" : 0,
    "length" : 238,
    "sectionHeading" : "Cause",
    "sectionLabel" : "disease.cause"
  }, ... ]
}

Table of Class Labels

en_disease	de_disease	en_city	de_city
cause	definition	architecture	architektur
classification	diagnose	climate	bildung
complication	epidemiologie	crime	demografie
culture	fauna	culture	erholung
diagnosis	forschung	demography	etymologie
epidemiology	genetik	district	gemeinde
etymology	geographie	economics	gemeindepartnerschaft
fauna	geschichte	education	geographie
genetics	infektion	environment	geschichte
geography	kategorisierung	etymology	infrastruktur
history	klinik	facility	kirche
infection	komplikation	faith	klima
management	mensch	geography	kriminalität
mechanism	organe	health	kultur
medication	pathologie	history	menschen
pathology	prävalenz	infrastructure	politik
pathophysiology	prognose	international_affairs	presse
prevention	risiko	law	regierung
prognosis	symptom	media	religion
research	terminologie	overview	sport
risk	therapie	people	stadtlandschaft
screening	ursache	politics	stadtviertel
surgery	verlauf	recreation	tourismus
symptom	vorbeugung	science	überblick
tomography	sonstiges	sights	verkehr
treatment		society	wirtschaft
other		sport	sonstiges
		tourism
		transport
		other

Credit

If you use this dataset for research, please cite:

Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers and Alexander Löser. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

@article{arnold2019sector,
  author = {Arnold, Sebastian and Schneider, Rudolf and Cudré-Mauroux, Philippe and Gers, Felix A. and Löser, Alexander},
  title = {SECTOR: A Neural Model for Coherent Topic Segmentation and Classification},
  journal = {Transactions of the Association for Computational Linguistics},
  volume = {7},
  pages = {169-184},
  year = {2019},
  doi = {10.1162/tacl\_a\_00261}
}

License

This dataset uses material from Wikipedia articles listed in the SOURCES file, which are released under the Creative Commons Attribution-ShareAlike 3.0 Unported License. You should have received a copy of the license along with this work. If not, see [http://creativecommons.org/licenses/by-sa/3.0/].

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
CREDIT		CREDIT
LICENSE		LICENSE
README.md		README.md
SOURCES		SOURCES
wikisection_dataset_json.tar.gz		wikisection_dataset_json.tar.gz
wikisection_dataset_ref.tar.gz		wikisection_dataset_ref.tar.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CREDIT

CREDIT

LICENSE

LICENSE

README.md

README.md

SOURCES

SOURCES