Skip to content

jonsafari/multiway-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multiway Corpus

This builds an n-way multilingual corpus, from the data in the awesome Tatoeba dataset. This allows you to do pivot-free zero-shot machine translation, as well as have unusual language combinations.

Usage is:

python3 intersect_tatoeba.py Spanish jpn English

The arguments are the languages that you want to intersect, either the ISO 639-3 names (eg. English) or codes (eg. eng). The output in this example will be corpus.jpn, corpus.spa, and corpus.eng .

First download two files into this directory, as these are constantly being updated upstream:

wget -c http://downloads.tatoeba.org/exports/sentences.tar.bz2  &&  tar jxvf sentences.tar.bz2
wget -c http://downloads.tatoeba.org/exports/links.tar.bz2      &&  tar jxvf links.tar.bz2

Then run the script. Enjoy!

Here are some languages in the upstream dataset:

Language ISO 639-3 Code Sentences
English eng 641421
Esperanto epo 511221
Turkish tur 503109
Russian rus 479397
Italian ita 474880
German deu 366934
French fra 315677
Spanish spa 265058
Portuguese por 231807
Hungarian hun 191328
Japanese jpn 184296
Hebrew heb 153655
Berber ber 104842
(Hundreds more languages)