Skip to content

Hoda233/Arabic-Text-Diacritization

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic-Text-Diacritization

Introduction

Diacritics are short vowels with a constant length that are spoken. The same word in the Arabic language can have different meanings and different pronunciations based on how it is diacritized.

In this project, we implement a pipeline to predict the diacritic of each character in an Arabic text using Natural Language Processing techniques.

Project Pipeline

Alt text

Project Phases

Data Processing

  • Split the sentences with punctuations.
  • Split into smaller sentences of length no more than 500 characters (without counting diacritics).
  • Remove all the non-Arabic characters.
  • Remove diacritics.
  • Start each sentence with <s> and end it with </s> (both will have a corresponding class ‘no diacritics’ ‘’)

Feature extraction

  • One Hot encoding char level
  • Trainable embeddings char level
  • Word2vec embeddings + oneHot word and char level

Model

  • BLSTM
  • RNN

Alt text

Evaluation

Diacritic Error Rate (DER) = 1 - Accuracy

Alt text

Results

Final model used for the test set submission on Kaggle: BLSTM model with char embedding layer

Team: The Powerpuff Girls

Alt text Alt text Alt text

demo video to the deployed model

nlp.mp4

Contributors

Asmaa Adel
Asmaa Adel
Asmaa Adel
Samaa Hazem
norhan reda
Norhan reda
HodaGamal
HodaGamal

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%