DiMSUM 2016 shared task data

December 28, 2015

Anders Johannsen
Nathan Schneider
Dirk Hovy
Marine Carpuat

This release contains data and scripts for the DiMSUM shared task at SemEval 2016.

Data files

`dimsum16.train`, task training data

The training data combines and harmonizes three data-sets, the STREUSLE 2.1 corpus of web reviews, as well as the Ritter and Lowlands Twitter datasets. The Ritter and Lowlands datasets have been reannotated for MWEs and supersenses to improve their quality and to more closely follow the conventions used in the STREUSLE annotations. Our harmonization also consisted of: updating the POS tags to use the 17 Universal POS categories; naming supersenses in the form n.person; removing STREUSLE class labels that are not proper supersenses (such as `a = auxiliary, `p = preposition, ?? = unintelligible); removing weak MWE links in the STREUSLE data; separating the MWE position and supersense into different fields; and listing the supersense only for the first token of any expression.

In this final release of the training data, a couple differences between the component datasets remain:

The Lowlands Twitter dataset replace usernames, URLs, and numbers by special symbols, while the original text is always preserved in the other datasets.
The Universal POS tags in the Twitter datasets do not use the new subordinating conjunction category SCONJ. Subordinating conjunctions are instead labeled as adpositions (ADP) or conjunctions (CONJ).

`dimsum16.test.blind`, task test input

This is in the same format as the training data, except without MWE and supersense annotations, which are to be predicted by the system:

there is no supersense label (column 8 is blank)
MWE tags (column 5) are all O, and MWE parent offsets (column 6) are all 0, indicating that no MWEs are marked
sentence IDs (column 9) are unanalyzable to obscure the sentence's source dataset and its order relative to other sentences; the sentences in this file are listed in a random order

Composition

The test set consists of 16,500 words in 1,000 English sentences. The sentences are drawn from the following sources:

online reviews from the TrustPilot corpus (Hovy et al. 2015)
tweets from the Tweebank corpus (Kong et al. 2014)
TED talk transcripts from the IWSLT MT evaluation campaigns, obtained from the WIT³ archive (Cettolo et al. 2012), some of which were used for the NAIST-NTT TED Talk Treebank (Neubig et al. 2014)

More precise information on the composition and preparation of the test corpus will be announced after the end of the task evaluation period.

File format

The DiMSUM files have tab-separated columns in the spirit of CoNLL, with blank lines to separate sentences.

Nine tab-separated columns:

token offset
word
lowercase lemma
POS
MWE tag
offset of parent token (i.e. previous token in the same MWE), if applicable
strength level encoded in the tag, if applicable. Currently not used
supersense label, if applicable
sentence ID

Fields 5, 6, and 8 need to be predicted at test time; the rest will be present in the input. Field 6 can be deterministically filled in given the tagging in field 5. Field 7 should be left blank. The file TAGSET.md describes the MWE and supersense tagsets.

All sentences in the training data are marked with an identifier whose prefix indicate the source dataset (field 9). In the test data, this field will contain a unique ID for the sentence, but the ID will be uninformative: it will not reveal the domain or document position of the sentence.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
conversion		conversion
scripts		scripts
README.md		README.md
TAGSET.md		TAGSET.md
dimsum16.test		dimsum16.test
dimsum16.test.blind		dimsum16.test.blind
dimsum16.train		dimsum16.train
submission.csv		submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conversion

conversion

scripts

scripts

README.md

README.md

TAGSET.md

TAGSET.md

dimsum16.test

dimsum16.test

dimsum16.test.blind

dimsum16.test.blind

dimsum16.train

dimsum16.train

submission.csv

submission.csv

Repository files navigation

DiMSUM 2016 shared task data

Data files

`dimsum16.train`, task training data

`dimsum16.test.blind`, task test input

Composition

File format

About

Releases 6

Packages

Contributors 3

Languages

dimsum16/dimsum-data

Folders and files

Latest commit

History

Repository files navigation

DiMSUM 2016 shared task data

Data files

dimsum16.train, task training data

dimsum16.test.blind, task test input

Composition

File format

About

Topics

Resources

Stars

Watchers

Forks

Languages

`dimsum16.train`, task training data

`dimsum16.test.blind`, task test input