Skip to content

Latest commit

 

History

History
311 lines (305 loc) · 21.6 KB

MonolingualData.md

File metadata and controls

311 lines (305 loc) · 21.6 KB

Tatoeba Challenge Data - Monolingual data sets

This is part of the Tatoeba Translation Challenge Data set. The following monolingual data sets are extracted from CirrusSearch Wikimedia dumps including:

  • Wikipedia
  • Wikibooks
  • Wikinews
  • Wikiquote
  • Wikisource

All data sets are in UTF8 plain text, one sentence per line and document boundaries (empty lines).

The packages below use the same division into languages and macro-languages as they are defined in the Tatoeba translation challenge. Language ID files with script information are also added to each data source in the same way as it is done for the bilingual data sets.

There are also packages with the original Wikipedia languages (converted to ISO-639-3) that you can download in a deduplicated and shuffled version or with document boundaries from this page

Simple pre-processing like unicode character normalisation and language-identification-based filtering has been applied to reduce some noise. The extraction scripts are part of OPUS-MT.