Skip to content

Python application, generating parallel corpus for any language pairs, can be used for training nmt (Neural Machine Translation) systems

License

Notifications You must be signed in to change notification settings

farshadjafari/parallel_corpus_generator

Repository files navigation

Multilingual Parallel Corpus Generator

Python application, generating parallel corpus for any language pairs, can be used for training nmt (Neural Machine Translation) systems

Manual

Using the described method in this article we implemented a small version of working framework to examine its results. The code is open-source and available to use. It's written in python language, using mongodb as database. Therefore working with it needs a fair knowledge of python language.

For running project do the following steps:

  • Install python on your system if dont have it. More information about installing python can be found here.

  • Install and run mongodb. More information about installing mongodb can be found here.

  • clone the project from github.

git clone https://github.com/farshadjafari/parallel_corpus_generator.git
  • create a virtual environment for python somewhere near the code:
virtualen venv
source venv/bin/activate
  • install required python packages with pip:
pip install requirements.txt

you can run all steps with default by running following commanf but it may take hours to days to get done. I recommend to run each step separately.

python main.py {source language code} {destination language code}

replace source and destination language code with your desired alpha2 language codes. For example:

python main.py en fa

initializing videos

in order to storing movie dataset in database run following command:

python movies/reader.py {source language code} {destination language code}

filters

edit movies/filter.py file for desired movie filters. The number of videos after filtering will show by running:

python movies/filter.py

crawl

in order to crawl and download synchronized subtitles for desired filters and languages run:

python downloader/main.py {source language code} {destination language code}

store dialogues

in order to storing synchronous dialogues in database run:

python subtitle/main.py {source language code} {destination language code}

match sentence pairs

in order to extract sentence pairs from stored dialogues run:

python matcher/main.py {source language code} {destination language code}

generate corpus

finally after running all commands above for generating parallel corpus run:

python generator/main.py {source language code} {destination language code}

About

Python application, generating parallel corpus for any language pairs, can be used for training nmt (Neural Machine Translation) systems

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages