Arabic Punctuation Prediction

Project Overview
Installation
Data Exploration and Visualization
Implementation
Results and Metrics
Acknowledgements

1. Project Overview

Punctuations are symbols used to organize written textual to make it clear and easy to read. Lack of punctuation could cause confusion and misunderstanding for the reader. This problem may occur when using an Automatic Speech Recognition (ASR), the ASR system doesn’t predict punctuation usually. The punctuation prediction is one of the Natural Language Processing (NLP) problems. The aim of this project is to build a Sequence to sequence model to predict punctuation in Arabic texts.

2. Installation

Python versions 3.*.
Python Libraries:
- matplotlib.
- tensorflow.
- sklearn.
- Pandas.
- keras.
- numpy.
- string.
- nltk
- tqdm
- time.
- re.

3. Data Exploration and Visualization

Tashkeela dataset, the dataset contains over 75 Arabic words obtained from 97 books. Also, it contains different types of punctuation. So, we will use it in this project. We focus on five punctuation marks: comma, dot, semicolon, question mark, two vertical points. Below are some texts from Al-Bahr Al-Muhit book.

وَقَالَ الْأُسْتَاذُ أَبُو مَنْصُورٍ : الْغَرَضُ مِنْ أُصُولِ الْفِقْهِ مَعْرِفَةُ أَدِلَّةِ أَحْكَامِ الْفِقْهِ ، وَمَعْرِفَةُ طُرُقِ الْأَدِلَّةِ ، لِأَنَّ مَنْ اسْتَقْرَأَ أَبْوَابَهُ وَجَدَهَا إمَّا دَلِيلًا عَلَى حُكْمٍ أَوْ طَرِيقًا يُتَوَصَّلُ بِهِ إلَى مَعْرِفَةِ الدَّلِيلِ ، وَذَلِكَ كَمَعْرِفَةِ النَّصِّ ، وَالْإِجْمَاعِ ، وَالْقِيَاسِ ، وَالْعِلَلِ ، وَالرُّجْحَانِ . وَهَذِهِ كُلُّهَا مَعْرِفَةٌ مُحِيطَةٌ بِالْأَدِلَّةِ الْمَنْصُوصَةِ عَلَى الْأَحْكَامِ . وَمَعْرِفَةُ الْأَخْبَارِ وَطُرُقِهَا مَعْرِفَةٌ بِالطُّرُقِ الْمُوَصِّلَةِ إلَى الدَّلَائِلِ الْمَنْصُوصَةِ عَلَى الْأَحْكَامِ .وَهَاهُنَا أُمُورٌ : أَحَدُهَا : أَنَّ الْأَسْمَاءَ الْمُسْتَعْمَلَةَ فِي هَذِهِ الْعُلُومِ . كَأُصُولِ الْفِقْهِ ، وَالْفِقْهِ ، وَالنَّحْوِ ، وَاللُّغَةِ ، وَالطِّبِّ . هَلْ هِيَ مَنْقُولَةٌ أَوْ لَا ؟

4. Implementation

In this project, we used the sequence to sequence (seq2seq) model the same technique used of Neural Machine Translation (NMT) provided by TensorFlow. The inputs pass through the encoder model to give s us the encoder output and encoder hidden state. Bahdanau Attention has been used for the encoder.

5. Result and Metrics

We trained our model using 10 epoch with 128 batch size. We use categorical cross-entropy loss function and Adam optimizer. A Bilingual Evaluation Understudy (BLEU). In the below figure, the results of the prediction for two examples.

6. Acknowledgements

I wish to thank Tashkeela for dataset. Also, thanks for Udacity for advice. For more details about this project can read this article.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
README.md		README.md
seq2seq_punctuation_prediction.ipynb		seq2seq_punctuation_prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md

seq2seq_punctuation_prediction.ipynb

seq2seq_punctuation_prediction.ipynb

Repository files navigation

Arabic Punctuation Prediction

1. Project Overview

2. Installation

3. Data Exploration and Visualization

4. Implementation

5. Result and Metrics

6. Acknowledgements

About

Releases

Packages

Languages

ZarahShibli/Arabic_Punctuation_Prediction

Folders and files

Latest commit

History

Repository files navigation

Arabic Punctuation Prediction

1. Project Overview

2. Installation

3. Data Exploration and Visualization

4. Implementation

5. Result and Metrics

6. Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Languages