Skip to content

AliOsm/shakkelha

Repository files navigation

PWC

Shakkelha

This repository contains the models, dataset, helpers, and systems' comparison for our paper on Arabic Text Diacritization:

"Neural Arabic Text Diacritization: State of the Art Results and a Novel Approach for Machine Translation", Ali Fadel, Ibraheem Tuffaha, Bara' Al-Jawarneh and Mahmoud Al-Ayyoub, EMNLP-IJCNLP 2019.

Files

  • predict.py - General script that can be used to predict using any model existing in this repository
  • sample-input - Sample input file
  • extra_train.zip - Contains the extra training dataset that was used to train the models

These folders each contain the generated dataset, system output and DER/WER statistics used to compare our system with each of the three systems.

  • constants
    • ARABIC_LETTERS_LIST.pickle - Contains a list of Arabic letters
    • DIACRITICS_LIST.pickle - Contains a list of all diacritics
    • FFNN_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each class to its unique integer (FFNN)
    • FFNN_REV_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each integer to its unique class (FFNN)
    • FFNN_SMALL_CHARACTERS_MAPPING.pickle - Contains a dictionary, mapping each character to its unique integer (Without using the extra training dataset for FFNN)
    • RNN_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each class to its unique integer (RNN)
    • RNN_REV_CLASSES_MAPPING.pickle - Contains a dictionary, mapping each integer to its unique class (RNN)
    • RNN_SMALL_CHARACTERS_MAPPING.pickle - Contains a dictionary, mapping each character to its unique integer (Without using using the extra training dataset for RNN)
    • RNN_BIG_CHARACTERS_MAPPING.pickle - Contains a dictionary, mapping each character to its unique integer (Using using the extra training dataset for RNN)
  • avg_checkpoints.py - Creates weights averaged models using the last epochs checkpoints from the training phase
  • build_confusion_matrix.py - Builds and plots confusion matrix using the gold data and the predicted output
  • build_der_figure.py - Restores and plots the diacritic error rate progress while training for each model from keras training log files
  • plot_character_embeddings.py - Plots embeddings extracted from any epoch checkpoint using t-SNE technique
  • count_error_frequency.py - Counts the frequency of errors in each diacritized word
  • prepare_feed_forward_data.py - Prepares FFNN models data
  • restore_model_accuracy_and_loss.py - Restores and plots the accuracy and loss values for FFNN models from keras training log files
  • optimizer.py - An implementation for Block-Normalized Gradient Method: An Empirical Study for Training Deep Neural Network paper copied from here
  • ffnn_models - Contains all feed-forward neural networks codes, models and statistics
    • 1_basic_model - Contains basic FFNN model training and predicting codes, model weights and DER/WER statistics
    • 2_100_hot_model - Contains 100 hot FFNN model training and predicting codes, model weights and DER/WER statistics
    • 3_embeddings_model - Contains embeddings FFNN model training and predicting codes, model weights and DER/WER statistics
  • rnn_models - Contains all recurrent neural networks codes, models and statistics
    • 1_basic_model - Contains basic RNN model training code, model weights, averaged models and DER/WER statistics. The model was trained with and without the extra training dataset
    • 2_crf_model - Contains CRF-RNN model training code, model weights, averaged models and DER/WER statistics. The model was trained with and without the extra training dataset
    • 3_normalized_model - Contains normalized RNN model training code, model weights, averaged models and DER/WER statistics. The model was trained with and without the extra training dataset

Usage

Prerequisites

  • Tested with Python 3.6.8
  • Install required packages listed in requirements.txt file
    • pip install -r requirements.txt

Predict

To predict the diacritized text using any model provided in this repository the script predict.py can be used, example:

python predict.py --input-file-path sample_input \
                  --model-type rnn \
                  --model-number 3 \
                  --model-size small \
                  --model-average 20 \
                  --output-file-path sample_output

The previous line will diacritize the text inside sample_input file using the rnn model that have number 3 trained on the small dataset (without extra training dataset) after averaging the last 20 epochs and writes the diacritized text on sample_output.

The allowed option are:

  • --model-type: ffnn, rnn
  • --model-number:
    • ffnn: 1, 2, 3
    • rnn: 1, 2, 3
  • --model-size: small, big
  • --model-average:
    • rnn: 1, 5, 10, 20

Train FFNN Model

Before training any FFNN model you need to prepare the dataset using prepare_feed_forward_data.py script. After that to train any FFNN model you can use the model.ipynb notebooks that exists under models/ffnn_models/*/

Train RNN Model

There is no need to prepare any data to train RNN models, to train any RNN model you can use the model.ipynb notebooks that exists under models/rnn_models/*/

Note that the RNN models use CuDNNLSTM layers which should run on GPU, to train the models or predict output from them using CPU only you can use regular LSTM layers. Moreover, all RNN models checkpoints exist under models/rnn_models/*/*/ use CuDNNLSTM layers, so the checkpoints should be loaded on GPU, but under model/rnn_models/*/*/lstm/ you can find the same checkpoints with same weights and structure but with regular LSTM layers used instead of CuDNNLSTM layers.

Basic RNN model structure

Results

All results reported below are on (Fadel et al., 2019) test set (best results shown in bold).

FFNN Results

There are three feed-forward neural network models, the following table show the results of each of them:

DER/WER With case ending Without case ending With case ending Without case ending
Including no diacritic Excluding no diacritic
Basic Model 9.33%/25.93% 6.58%/13.89% 10.85%/25.39% 7.51%/13.53%
100-Hot Model 6.57%/20.21% 4.83%/11.14% 7.75%/19.83% 5.62%/10.93%
Embeddings Model 5.52%/17.12% 4.06%/9.38% 6.44%/16.63% 4.67%/9.10%

DER/WER statistics for FFNN models

RNN Results

There are three recurrent neural network models, each of them was trained twice with and without the extra training dataset. The following tables show the results of each of them:

DER/WER With case ending Without case ending With case ending Without case ending
Including no diacritic Excluding no diacritic
Basic Model 2.68%/7.91% 2.19%/4.79% 3.09%/7.61% 2.51%/4.66%
CRF Model 2.67%/7.73% 2.19%/4.69% 3.08%/7.46% 2.52%/4.60%
Normalized Model 2.60%/7.69% 2.11%/4.57% 3.00%/7.39% 2.42%/4.44%

DER/WER statistics for RNN models without training on extra training dataset

DER/WER With case ending Without case ending With case ending Without case ending
Including no diacritic Excluding no diacritic
Basic Model 1.72%/5.16% 1.37%/2.98% 1.99%/4.96% 1.59%/2.92%
CRF Model 1.84%/5.42% 1.47%/3.17% 2.13%/5.22% 1.69%/3.09%
Normalized Model 1.69%/5.09% 1.34%/2.91% 1.95%/4.89% 1.54%/2.83%

DER/WER statistics for RNN models with training on extra training dataset

The following figure shows the validation DER of each model while training reported every 5 epochs.

RNN models validation DER while training

Note: All codes in this repository tested on Ubuntu 18.04

Contributors

  1. Ali Hamdi Ali Fadel.
  2. Ibraheem Tuffaha.
  3. Bara' Al-Jawarneh.
  4. Mahmoud Al-Ayyoub.

License

The project is available as open source under the terms of the MIT License.