Skip to content

thomas-chauvet/names_transliteration

Repository files navigation

Streamlit App Open In Colab

Names transliteration

In this repository you will find:

  • a dataset (and associated code to build it) containing names in arabic characters and associated names in latin characters (english),
  • a (google colab) notebook to train a Neural Machine Translation (NMT) model based on seq2seq. The objective of this model is to transliterate names in arabic alphabet to latin alphabet. This task is also called romanization.

The model is trained thanks to Google Colab providing (free) GPU.

The model is based on Tensorflow tutorial NMT with attention.

Data

We use 3 datasets:

These 3 datasets will give us a clean dataset containing names in arabic and corresponding names in latin alphabet (english).

Pre-trained models

A pre-trained model (arabic to latin characters) is stored on dropbox.

Colab notebook

Open In Colab

A jupyter notebook is provided to train the model used for transliteration.

Web application - Streamlit

A streamlit is provided. You can find a deployed version here.

Streamlit App

Library

Install library:

python setup.py install

CLI

  • get-data: Get data from 3 sources to get a training dataset.
  • get-pretrained-model: Download pre-trained model for the task.
  • train-nmt-model: Train an NMT model.
  • transliterate-name: Transliterate a name in arabic in latin character.

Python environment

Please refer to the environment.yml file for conda environment.

To create the environment with conda:

conda env create -f environment.yml