Skip to content

Arabic Word Embedding models SkipGram, and GLoVE are trained over Arabic Wiki data Dump 2018 dataset from scratch using Gensim and GLoVE python libraries. Then the models are evaluated on three NLP tasks and its results are visualized in T-SNE

Notifications You must be signed in to change notification settings

shaimaaK/arabic-word-embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic Word Embedding

THis project is implemented as part of the "Natural Language Processing and Deep learning " course during my master's degree. In this project I have created two word embedding models: Word2Vec-SkipGram and GLoVE using ArWiki Dump 2018 Dataset where the Skipgram model is imporved by tuning the values for vector size and window then the evaluation of each model on 4 stages with visualization of the results using T-SNE. The report of the project provides further inforamtion about the experiments and the results analyses and discussion

Table of contents

Workflow

Step 1 preprocessing

Step 1.1: reading the corpus
Parse the compressed arabic wiki articles of the format .bz2 using the Gensim utility WikiCorpus then make sure the encoding is set to utf-8 as arabic language is encoded as the latin based encoding : utf-8
Step 1.2: remove unwanted characters from the scanned articles

  • Non-arabic character (mainly english in upper and lower case )
  • Digits [0-9]
  • Extra spaces
  • Tashkeel and tatweel (arabic diacritics)
Step 1.3: save the output to corpus_cleaned.txt
The output of the preprocessing is stored hence is this step to generate the cleaned data is only executed once. Note the corpus_cleaned.txt is omitted from the repository as it exceeds the allowed size for github repository

Step 2 define parameters for tuning

gensim.models.Word2Vec(sentences,vector_size,window,sg,workers)


  • List of used Training parameters:
    • sentences : corpus of the dataset
    • vector_size (int, optional) – Dimensionality of the word vectors.
    • window :(int, optional) – Maximum distance between the current and predicted word within a sentence
    • sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW
    • workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
  • SkipGram List of parameters tuning:
    • vector_size_list=[500,1000]
    • window_list=[10,15,20]
  • GLoVE List of parameters tuning
    • learning rate=[0.01,0.05]
    • window_list=[10,15,20]
    As common knowledge in the NLP research community the window size starts from 5, therefore we have tried 10,15,20 on SkipGram and on GLoVE we tired 10,15. Another parameter dependent on the training corpus, is the embedding matrix size which is tested as 500, and 1000. Unfortunately, (as expected) the value 1000 generated a memory error as the environment memory is unable to allocate the enough space to run either of the algorithms therefore is fixed to 500. Lastly some parameters are dedicated to GLoVE are also experimented with such as the learning rate 0.01 and 0.05 while epochs parameter is fixed to 50 to avoid extensive runtime. It worth mentioning, the experiments are tested on 3 variations of SkipGram model and 4 varations of GLoVE model and the results discussed are chosen from the full results which are available and can be tested using the provided code.

Step 3 build the word embedding models

Using GenSim and GLoVE libararies on python the arabic-word-embedding models are trained and saved. It is woth noting that the GLoVE library only worked on Colab with older versions of python (3.7 and lower) as the library implementation is developed for those version of python

Step 4 evaluate the performance

Test 1 : Most Similar Words
Find the top-N most similar words. Positive words contribute positively towards the similarity, negative words negatively. link

  • Pick 8 Arabic words and, for each one, ask each model about the most similar 10 words to it. Plot the results using t-SNE (or scatterplot) and discuss them

Test 2: Odd-One-Out

  • we ask our model to give us the word that does not belong to the list doc
  • Pick 5-10 triplets of Arabic words and, for each one, ask each model to pick the word in the triplet that does not belong to it. Discuss the results.

Test 3: Measuring Sentence Similarity
Find the Sentences similar to each other by computing the cosine similarity function of the two embedding vectors as in Paul Minogue blog
write 5 sentences in Arabic. For each sentence, pick 2-3 words and replace them with their synonyms or antonyms. Use your embeddings to compute the similarity between each sentence and its modified version. Discuss the results

Test 4: Analogy

  • Syntax in link
  • pick 5-10 cases of analogies in Arabic, like the one we used in class:

Requirements

  • glove-python-binary
  • arabic_reshaper
  • python-bidi
  • pyarabic.araby
  • gensim
  • matplotlib.pyplot
  • seaborn
  • sklearn.manifold.TSNE

References and Resources

All references and resources are documented used in each step are documented in the .ipynb file in markdown

About

Arabic Word Embedding models SkipGram, and GLoVE are trained over Arabic Wiki data Dump 2018 dataset from scratch using Gensim and GLoVE python libraries. Then the models are evaluated on three NLP tasks and its results are visualized in T-SNE

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published