Multiway Corpus

This builds an n-way multilingual corpus, from the data in the awesome Tatoeba dataset. This allows you to do pivot-free zero-shot machine translation, as well as have unusual language combinations.

Usage is:

python3 intersect_tatoeba.py Spanish jpn English

The arguments are the languages that you want to intersect, either the ISO 639-3 names (eg. English) or codes (eg. eng). The output in this example will be corpus.jpn, corpus.spa, and corpus.eng .

First download two files into this directory, as these are constantly being updated upstream:

wget -c http://downloads.tatoeba.org/exports/sentences.tar.bz2  &&  tar jxvf sentences.tar.bz2
wget -c http://downloads.tatoeba.org/exports/links.tar.bz2      &&  tar jxvf links.tar.bz2

Then run the script. Enjoy!

Here are some languages in the upstream dataset:

Language	ISO 639-3 Code	Sentences
English	eng	641421
Esperanto	epo	511221
Turkish	tur	503109
Russian	rus	479397
Italian	ita	474880
German	deu	366934
French	fra	315677
Spanish	spa	265058
Portuguese	por	231807
Hungarian	hun	191328
Japanese	jpn	184296
Hebrew	heb	153655
Berber	ber	104842
(Hundreds more languages)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
intersect_tatoeba.py		intersect_tatoeba.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

intersect_tatoeba.py

intersect_tatoeba.py

Repository files navigation

Multiway Corpus

About

Releases

Packages

Languages

License

jonsafari/multiway-corpus

Folders and files

Latest commit

History

Repository files navigation

Multiway Corpus

About

Topics

Resources

License

Stars

Watchers

Forks

Languages