de-nds-translation

Neural Machine Translation from German to the low-resource language Low German.

This repo contains the workflow for creating the first specialized "German" - "Low German" translator översetter.de.

If you have suggestions for improvement and/or want to participate in another way, please don't hesitate to get in contact with me.

Quick Workflow-Overview

Gathering and preprocessing data: data_preprocessing.ipynb
Data selection: self_learning_model.py
Final model for translating: translation_model.py

Minor parts of the projects can't be published right now due to only personal permission for using the data. Still, you should be able to run all with the uploaded content.

Low German

Low German is a language spoken in Northern Germany. Once the dominant language until the mid of 20th century, it disappeared nowadays in daily usage. Although statistics count 1-2 million Low German speakers, which might be relatively high compared to other low resource languages, 99.2% of the people under 20 can't speak Low German in Northern Germany. The language is dying rapidly and therefore it is listed as endangered language by the UNESCO.

With the help of Neural Machine Translation this project wants to support the community which is working every day to keep the language alive.

Photo by [Marian on Unsplash](https://unsplash.com/@minjax)

Low German as low resource language

As it is mainly present in very old generations, also the online resources of Low German are limited. Moreover Low German has its own vocabulary and grammar which prevents a word by word translation from German to Low German. Another characteristic of Low German is the wide variety of spelling. In each region you have a slightly different Low German with its own dialect and its own spelling. For the same Low German word you might find several writings. The online dictionairy from Peter Hansen gives a good overview about the different spellings.

Available data

If you have any Low German data which could be used for improving the translations, please let me know!

Beside that I have found two datasets, Tatoeba and WikiMatrix, which fulfil two aspects, which were crucial for starting: Digital available and aligned sentences in German and Low German like this one:

German	Low German
Er hat mich lange warten lassen.	He hett mi lang töven laten.
Sie wollen reich werden.	Se wüllt riek warrn.
Niemand hat diesen Satz gelöscht.	Nüms hett dissen Satz wegdaan.

In the notebook "data_preprocessing.ipynb" you can see how the datasets where preprocessed.

You can download the Tatoeba tsv files from the website. The data is provided under the CC BY 2.0 FR license. Facebook aligned through all languages of Wikipedia suitable sentences and published it on: https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix The corresponding paper was published from: Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019. The data is provided under the Creative Commons Attribution-ShareAlike license.

As the sentences are written by volunteers from different regions, there is no common spelling within the sentences. With the personal permission of Peter Hansen (thank you very much), I scraped his online dictionairy with different possible spellings and replaced it by his proposal. With this method I could replace around 5% of the words to get a more uniform spelling. You will find an abstracted version of this in the data_preprocessing.ipynb.

Selecting right sentences

The WikiMatrix dataset is built automatically and there was no quality control by a professional. The dataset includes misalignemnts even for sentences where Facebook Research was pretty sure that these sentences are aligned correctly. Therefore the main challenge for this dataset is to extract the right sentence pairs from the dataset. You will find in self_learning_model.py an algorithm based on a Transformer-Seq2Seq model which selects the best sentences from the dataset.

The neural network basis construction is by Ben Trevett and adapted for our purpose.

Main idea is to train a model on sentence pairs where we know that these are correct. After that we take sentence pairs with unknown quality (WikiMatrix dataset) and translate with our model the German sentence into an artificial Low German translation and calculate the error to the provided translation. Our model can already translate roughly, so the sentences with a low error probably contain somehow the same content even if the artifical translation is not correct. Or said the other way round: the sentences with a high error are significantly different to what the model learned before. This could have two reasons: The sentence pair could have the same content but on a higher level and our model is not good enough to translate this sentence OR we have a misalignment where the German and Low German sentence doesn't contain the same content. Therefore we select the sentences with the lowest error and include them into our training set.

With this approach I was able to beat a random-pick baseline model which saw the same data fragments, but picked from these subsets random sentences and achieved better translation results. A more in depth analysis you will find in this Google Presentation or you just simply contact me for further questions.

Future Work

To Do's for the future

Algorithm

Try model with more layers
Use other pre-trained model: e.g. OpenNMT and XLM
better word correction / automatic input correction

Dataset

more data

App

user interface
better feedback function with login

Licenses

The code is licensed under the MIT License.

The modified data (corrected spelling) will be uploaded soon under the Creative Commons Attribution-ShareAlike license.

Supporters

Special thanks for consulting and general support to: niowniow Sven Wildermann

Changelog

14.04.2020: Start of the project
10.05.2020: Prediction-Model 0.1 Release: Publishing first prototype online
13.05.2020: Prediction-Model 0.2 Release: Upper - & lowercase prediction together with model improvements
02.06.2020: Community feedback and correction function
07.06.2020: Better infrastructure for web application
07.06.2020: Prediction-Model 0.3 Release: additional monolingual training
10.07.2020: Autocorrection of High-German sentence before translation
10.07.2020: Prediction-Model 0.5 Release: Sentencepiece tokenizer & Transfer learning with German-English pretrained model
28.07.2020: Prediction-Model 0.6 Release: Sentencepiece tokenizer, transfer learning, monolingual training and better translations for single words

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
app		app
LICENSE.md		LICENSE.md
README.md		README.md
data_preprocessing.ipynb		data_preprocessing.ipynb
self_learning_model.py		self_learning_model.py
self_learning_model_baseline.py		self_learning_model_baseline.py
self_learning_models_evaluation.ipynb		self_learning_models_evaluation.ipynb
translate_input.py		translate_input.py
translation_model.py		translation_model.py
translation_sentencepiece.py		translation_sentencepiece.py

License

mmcux/de-nds-translation

Folders and files

Latest commit

History

Repository files navigation