GitHub - senisioi/enntt-release: This is a monolingual English corpus of native, non-native and (human) translated texts extracted from the European Parliament.

Europarl corpus of native, non-native and translated texts - ENNTT

Please check the release section for the latest version, also available at the Center for Computational Linguistics
A complete description of this resource is available here: A Corpus of Native, Non-native and Translated Texts, LREC, 2016, PDF
For the raw corpus, please check the dataset available here
For the experiments presented in the ACL 2016 paper, please check the dataset available here
For the experiments presented in the LREC 2016 paper, please check the dataset available here

Short description:

This is a monolingual English corpus of native, non-native and (human) translated texts extracted from the European Parliament. The translated texts from different source languages represent a subset of the Haifa Corpus of Translationese. We preserved the same annotation style and included an ID and the EU state that each member of the European Parliament represents.
We hope this dataset will facilitate a unified comparative study of translations and language produced by highly fluent non-native speakers, two closely-related phenomena that have only been studied in isolation so far.
For updates, please check the official repository

If you use this work in your research, please cite:

@InProceedings{enntt-corpus,
  author = {Sergiu Nisioi and Ella Rabinovich and Liviu P. Dinu and Shuly Wintner},
  title = {A Corpus of Native, Non-native and Translated Texts},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portoro\u{z}, Slovenia},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-9-1},
  language = {english}
 }

File description:

*.tok files contain tha actual text uttered either in English by natives and non-natives or translated to English from other languages
*.dat files contain the annotations corresponding to each line in the *.tok files.

Description of annotations:

NAME - speaker's name as it appears in the written session
LANGUAGE - original language in which the sentence was uttered
SESSION_ID - the name of the corresponding protocol source file
SEQ_SPEAKER_ID - sequential number of the speaker within a session

Sentences uttered in English are annotated with additional information:

STATE - the EU state represented by the MEP
MEPID - the ID used by the Europarl website to display the MEPs online images

For more details about this particular dataset, mailto:sergiu.nisioi at gmail com or mailto:ellarabi at csweb dot haifa dot ac dot il

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Europarl corpus of native, non-native and translated texts - ENNTT

Short description:

File description:

Description of annotations:

Sentences uttered in English are annotated with additional information:

About

Releases 4

Packages

senisioi/enntt-release

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Europarl corpus of native, non-native and translated texts - ENNTT

Short description:

File description:

Description of annotations:

Sentences uttered in English are annotated with additional information:

About

Resources

Stars

Watchers

Forks

Releases 4

Packages 0

Packages