Tatoeba Challenge Data - Monolingual data sets

This is part of the Tatoeba Translation Challenge Data set. The following monolingual data sets are extracted from CirrusSearch Wikimedia dumps including:

Wikipedia
Wikibooks
Wikinews
Wikiquote
Wikisource

All data sets are in UTF8 plain text, one sentence per line and document boundaries (empty lines).

The packages below use the same division into languages and macro-languages as they are defined in the Tatoeba translation challenge. Language ID files with script information are also added to each data source in the same way as it is done for the bilingual data sets.

There are also packages with the original Wikipedia languages (converted to ISO-639-3) that you can download in a deduplicated and shuffled version or with document boundaries from this page

Simple pre-processing like unicode character normalisation and language-identification-based filtering has been applied to reduce some noise. The extraction scripts are part of OPUS-MT.

aar
abk
ace
ady
afr
aka
amh
ang
ara
arc
arg
asm
ast
atj
ava
awa
aym
aze
bak
bam
ban
bar
bel
ben
bih
bik
bis
bod
bpy
bre
bua
bug
bul
cat
ceb
ces
cha
che
chm
cho
chr
chu
chv
chy
cor
cos
cre
crh
csb
cym
dan
deu
din
div
dsb
dzo
ell
eml
eng
epo
est
eus
ewe
ext
fao
fas
fij
fin
fra
frp
frr
fry
ful
fur
gag
gcr
gla
gle
glg
glk
glv
gor
got
grn
guj
hat
hau
haw
hbs
heb
her
hif
hin
hmo
hsb
hun
hye
hyw
ibo
ido
iii
iku
ile
ilo
ina
inh
ipk
isl
ita
jam
jav
jbo
jpn
kaa
kab
kal
kan
kas
kat
kau
kaz
kbd
kbp
khm
kik
kin
kir
kok
kom
kon
kor
krc
ksh
kua
kur
lad
lah
lao
lat
lav
lbe
lez
lfn
lij
lim
lin
lit
lmo
lrc
ltz
lug
mah
mai
mal
mar
mdf
mkd
mlg
mlt
mnw
mon
mri
msa
mus
mwl
mya
myv
mzn
nah
nap
nau
nav
ndo
nds
nep
new
nld
nor
nov
nqo
nrm
nso
nya
oci
olo
ori
orm
oss
pag
pam
pan
pap
pcd
pdc
pfl
pih
pli
pms
pnt
pol
por
pus
que
roh
rom
ron
rue
run
rus
sag
sah
san
sat
scn
sco
shn
sin
slk
slv
sme
smo
sna
snd
som
sot
spa
sqi
srd
srn
ssw
stq
sun
swa
swe
szl
szy
tah
tam
tat
tcy
tel
ten
tet
tgk
tgl
tha
tir
ton
tpi
tsn
tso
tuk
tum
tur
tyv
udm
uig
ukr
urd
uzb
vec
ven
vep
vie
vls
vol
war
wln
wol
xal
xho
xmf
yid
yor
zea
zha
zho
zul
zza

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MonolingualData.md

MonolingualData.md

Tatoeba Challenge Data - Monolingual data sets

Files

MonolingualData.md

Latest commit

History

MonolingualData.md

File metadata and controls

Tatoeba Challenge Data - Monolingual data sets