Skip to content

ZarahShibli/Arabic_Punctuation_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Arabic Punctuation Prediction


  1. Project Overview
  2. Installation
  3. Data Exploration and Visualization
  4. Implementation
  5. Results and Metrics
  6. Acknowledgements

1. Project Overview

Punctuations are symbols used to organize written textual to make it clear and easy to read. Lack of punctuation could cause confusion and misunderstanding for the reader. This problem may occur when using an Automatic Speech Recognition (ASR), the ASR system doesn’t predict punctuation usually. The punctuation prediction is one of the Natural Language Processing (NLP) problems. The aim of this project is to build a Sequence to sequence model to predict punctuation in Arabic texts.

2. Installation

  • Python versions 3.*.
  • Python Libraries:
    • matplotlib.
    • tensorflow.
    • sklearn.
    • Pandas.
    • keras.
    • numpy.
    • string.
    • nltk
    • tqdm
    • time.
    • re.

3. Data Exploration and Visualization

Tashkeela dataset, the dataset contains over 75 Arabic words obtained from 97 books. Also, it contains different types of punctuation. So, we will use it in this project. We focus on five punctuation marks: comma, dot, semicolon, question mark, two vertical points. Below are some texts from Al-Bahr Al-Muhit book.

وَقَالَ الْأُسْتَاذُ أَبُو مَنْصُورٍ : الْغَرَضُ مِنْ أُصُولِ الْفِقْهِ مَعْرِفَةُ أَدِلَّةِ أَحْكَامِ الْفِقْهِ ، وَمَعْرِفَةُ طُرُقِ الْأَدِلَّةِ ، لِأَنَّ مَنْ اسْتَقْرَأَ أَبْوَابَهُ وَجَدَهَا إمَّا دَلِيلًا عَلَى حُكْمٍ أَوْ طَرِيقًا يُتَوَصَّلُ بِهِ إلَى مَعْرِفَةِ الدَّلِيلِ ، وَذَلِكَ كَمَعْرِفَةِ النَّصِّ ، وَالْإِجْمَاعِ ، وَالْقِيَاسِ ، وَالْعِلَلِ ، وَالرُّجْحَانِ . وَهَذِهِ كُلُّهَا مَعْرِفَةٌ مُحِيطَةٌ بِالْأَدِلَّةِ الْمَنْصُوصَةِ عَلَى الْأَحْكَامِ . وَمَعْرِفَةُ الْأَخْبَارِ وَطُرُقِهَا مَعْرِفَةٌ بِالطُّرُقِ الْمُوَصِّلَةِ إلَى الدَّلَائِلِ الْمَنْصُوصَةِ عَلَى الْأَحْكَامِ .وَهَاهُنَا أُمُورٌ : أَحَدُهَا : أَنَّ الْأَسْمَاءَ الْمُسْتَعْمَلَةَ فِي هَذِهِ الْعُلُومِ . كَأُصُولِ الْفِقْهِ ، وَالْفِقْهِ ، وَالنَّحْوِ ، وَاللُّغَةِ ، وَالطِّبِّ . هَلْ هِيَ مَنْقُولَةٌ أَوْ لَا ؟

4. Implementation

In this project, we used the sequence to sequence (seq2seq) model the same technique used of Neural Machine Translation (NMT) provided by TensorFlow. The inputs pass through the encoder model to give s us the encoder output and encoder hidden state. Bahdanau Attention has been used for the encoder.

5. Result and Metrics

We trained our model using 10 epoch with 128 batch size. We use categorical cross-entropy loss function and Adam optimizer. A Bilingual Evaluation Understudy (BLEU). In the below figure, the results of the prediction for two examples. 128

6. Acknowledgements

I wish to thank Tashkeela for dataset. Also, thanks for Udacity for advice. For more details about this project can read this article.

Releases

No releases published

Packages

No packages published