Skip to content

Ahmad-Zaki/Arabic_Dialect_Identification

Repository files navigation

Arabic Dialect Identification

Introduction

Many countries speak Arabic; however, each country has its own dialect, the aim of this project is to build a model that predicts the dialect given the text.

Environment setup

# using pip
pip install -r requirements.txt

# using Conda
conda create --name <env_name> --file requirements.txt

Method

  • Start by fetching the text data from API using fetch_data.py script.
python fetch_data.py

Machine Learning Approach

  • Preprocessing:

    1- Text Normalization: Done using ArabicTextNormalizer, found in preprocessing.py

    2- Victorization using Tf-Idf vectorizer with ngram_range = (1,5) and min_df = 10.

    3- Split the dataset into training, validation, and testing splits with (8:1:1) ratio.

    • Note: Other methods were tested for Preprocessing, you can see it in preprocessing.py.
  • Training: For a machine learning approach, Logistic Regression model is used. you can train the model using trainer.py script. For more details about the training process you can see training_notebook.ipynb

    • Classification Report on test data:

      Dialect precision recall f1-score support
      AE 0.4283 0.3802 0.4028 2630
      BH 0.3661 0.2864 0.3214 2629
      DZ 0.5851 0.4376 0.5007 1618
      EG 0.6504 0.8522 0.7378 5764
      IQ 0.6775 0.4987 0.5745 1550
      JO 0.4379 0.3044 0.3592 2792
      KW 0.4177 0.6037 0.4938 4211
      LB 0.6045 0.6459 0.6245 2762
      LY 0.5952 0.6797 0.6347 3650
      MA 0.7735 0.5208 0.6225 1154
      OM 0.4402 0.2965 0.3544 1912
      PL 0.4358 0.5649 0.4920 4374
      QA 0.4537 0.4641 0.4589 3107
      SA 0.3917 0.4372 0.4132 2683
      SD 0.7652 0.5239 0.6220 1443
      SY 0.5165 0.2691 0.3538 1624
      TN 0.7488 0.3323 0.4603 924
      YE 0.5482 0.1259 0.2048 993
      accuracy - - 0.5168 45820
      macro avg 0.5465 0.4569 0.4795 45820
      weighted avg 0.5225 0.5168 0.5055 45820
    • Confusion matrix for test data:

    Confusion matrix for test data
  • Predictions: You can easily get predictions for any text using the available API. To get it started, run run.sh or type uvicorn api:app in the terminal and call the API by a POST request to 127.0.0.1:8000/predict.

    • Request body sample:
{
    "text": "متهيالي دي شكولاته الهالوين فين المحل ده"
}
  • Response sample:
{
  "text": "متهيالي دي شكولاته الهالوين فين المحل ده",
  "predictions": {
    "AE": 0,
    "BH": 0,
    "DZ": 0.001,
    "EG": 0.98,
    "IQ": 0,
    "JO": 0,
    "KW": 0,
    "LB": 0,
    "LY": 0,
    "MA": 0.002,
    "OM": 0,
    "PL": 0.003,
    "QA": 0,
    "SA": 0,
    "SD": 0.007,
    "SY": 0,
    "TN": 0.005,
    "YE": 0
  }
}
  • If you want to predict the dialect of a patch of texts, call the API by a POST request to 127.0.0.1:8000/predict-batch
    • Request body sample:
{
  "texts": [
    "متهيالي دي شكولاته الهالوين فين المحل ده",
    "شلونك خوي؟"
  ]
}
  • Response sample:
{
  "predictions": [
    {
      "text": "متهيالي دي شكولاته الهالوين فين المحل ده",
      "predictions": {
        "AE": 0,
        "BH": 0,
        "DZ": 0.001,
        "EG": 0.98,
        "IQ": 0,
        "JO": 0,
        "KW": 0,
        "LB": 0,
        "LY": 0,
        "MA": 0.002,
        "OM": 0,
        "PL": 0.003,
        "QA": 0,
        "SA": 0,
        "SD": 0.007,
        "SY": 0,
        "TN": 0.005,
        "YE": 0
      }
    },
    {
      "text": "شلونك خوي؟",
      "predictions": {
        "AE": 0.012,
        "BH": 0.199,
        "DZ": 0.004,
        "EG": 0.006,
        "IQ": 0.098,
        "JO": 0.018,
        "KW": 0.112,
        "LB": 0.006,
        "LY": 0.48,
        "MA": 0.004,
        "OM": 0.015,
        "PL": 0.007,
        "QA": 0.008,
        "SA": 0.007,
        "SD": 0.005,
        "SY": 0.008,
        "TN": 0.004,
        "YE": 0.007
      }
    }
  ]
}
  • To get the status of the model contained in the API, make a GET request to 127.0.0.1:8000/status
    • If a trained model is available, the response should look like this:
{
  "status": "Model Ready",
  "timestamp": "2022-03-13T13:14:45.789941",
  "classes": [
    "AE",
    "BH",
    "DZ",
    "EG",
    "IQ",
    "JO",
    "KW",
    "LB",
    "LY",
    "MA",
    "OM",
    "PL",
    "QA",
    "SA",
    "SD",
    "SY",
    "TN",
    "YE"
  ],
  "evaluation": {
    "AE": {
      "precision": 0.4282655246252677,
      "recall": 0.38022813688212925,
      "f1-score": 0.40281973816717015,
      "support": 2630
    },
    "BH": {
      "precision": 0.3660670879922217,
      "recall": 0.28642069227843286,
      "f1-score": 0.3213828425096031,
      "support": 2629
    },
    "DZ": {
      "precision": 0.5851239669421487,
      "recall": 0.43757725587144625,
      "f1-score": 0.5007072135785007,
      "support": 1618
    },
    "EG": {
      "precision": 0.6504237288135594,
      "recall": 0.8521859819569744,
      "f1-score": 0.7377590868128568,
      "support": 5764
    },
    "IQ": {
      "precision": 0.677475898334794,
      "recall": 0.49870967741935485,
      "f1-score": 0.5745076179858789,
      "support": 1550
    },
    "JO": {
      "precision": 0.43791859866048427,
      "recall": 0.3044412607449857,
      "f1-score": 0.3591802239594338,
      "support": 2792
    },
    "KW": {
      "precision": 0.41774856203779787,
      "recall": 0.603657088577535,
      "f1-score": 0.49378399378399374,
      "support": 4211
    },
    "LB": {
      "precision": 0.6045408336157235,
      "recall": 0.6459087617668356,
      "f1-score": 0.624540521617364,
      "support": 2762
    },
    "LY": {
      "precision": 0.5952495201535508,
      "recall": 0.6797260273972603,
      "f1-score": 0.634689178818112,
      "support": 3650
    },
    "MA": {
      "precision": 0.7734877734877735,
      "recall": 0.5207972270363952,
      "f1-score": 0.6224754013464527,
      "support": 1154
    },
    "OM": {
      "precision": 0.44021739130434784,
      "recall": 0.2965481171548117,
      "f1-score": 0.354375,
      "support": 1912
    },
    "PL": {
      "precision": 0.43580246913580245,
      "recall": 0.5649291266575217,
      "f1-score": 0.49203504579848667,
      "support": 4374
    },
    "QA": {
      "precision": 0.45374449339207046,
      "recall": 0.4641132925651754,
      "f1-score": 0.45887032617342877,
      "support": 3107
    },
    "SA": {
      "precision": 0.391652754590985,
      "recall": 0.4371971673499814,
      "f1-score": 0.41317365269461076,
      "support": 2683
    },
    "SD": {
      "precision": 0.7651821862348178,
      "recall": 0.5239085239085239,
      "f1-score": 0.6219662690250926,
      "support": 1443
    },
    "SY": {
      "precision": 0.516548463356974,
      "recall": 0.26908866995073893,
      "f1-score": 0.3538461538461538,
      "support": 1624
    },
    "TN": {
      "precision": 0.748780487804878,
      "recall": 0.33225108225108224,
      "f1-score": 0.46026986506746626,
      "support": 924
    },
    "YE": {
      "precision": 0.5482456140350878,
      "recall": 0.12588116817724068,
      "f1-score": 0.20475020475020475,
      "support": 993
    },
    "accuracy": 0.5168485377564382,
    "macro avg": {
      "precision": 0.5464708530287936,
      "recall": 0.4568649587748014,
      "f1-score": 0.47950735199637834,
      "support": 45820
    },
    "weighted avg": {
      "precision": 0.522461860325834,
      "recall": 0.5168485377564382,
      "f1-score": 0.5055483538291743,
      "support": 45820
    }
  }
}
  • To train the model on a new dataset, call the API by a POST request to 127.0.0.1:8000/train
    • Request body sample:
{
  "texts": [
    "text1",
    "text2"
  ],
  "labels": [
    "label1",
    "label2"
  ]

Deep Learning Approach

  • Preprocessing:

    1- Text Normalization: Done using ArabicTextNormalizer, found in preprocessing.py

    2- Tokenization and padding: Used Tokeniner with num_words=100000 and max. sequence length of 50.

    3- Split the dataset into training, validation, and testing splits with (8:1:1) ratio.

  • Model Structure:

    • Embedding layer with dim = 100.
    • LSTM layer with 100 nodes.
    • Dense layer with 18 nodes and softmax activation function.
    Layer(type) Output Shape Param #
    Embedding (None, 50, 100) 10,000,000
    SpatialDropout1D (None, 50, 100) 0
    LSTM (None, 100) 80,4000
    Dense (None, 18) 1,818
  • Classification Report on test data:

    Dialect precision recall f1-score support
    AE 0.4324 0.4388 0.4356 2630
    BH 0.3812 0.3203 0.3481 2629
    DZ 0.5576 0.5173 0.5367 1618
    EG 0.6837 0.8525 0.7589 5764
    IQ 0.6653 0.5232 0.5858 1550
    JO 0.4855 0.3177 0.3841 2792
    KW 0.4910 0.5526 0.5200 4211
    LB 0.5856 0.6883 0.6328 2762
    LY 0.6156 0.6964 0.6536 3650
    MA 0.7016 0.5971 0.6451 1154
    OM 0.4052 0.3766 0.3903 1912
    PL 0.4977 0.5103 0.5039 4374
    QA 0.4826 0.4718 0.4771 3107
    SA 0.3698 0.4562 0.4085 2683
    SD 0.7183 0.5495 0.6227 1443
    SY 0.4397 0.3436 0.3858 1624
    TN 0.6237 0.4665 0.5337 924
    YE 0.3995 0.1762 0.2446 993
    accuracy - - 0.5348 45820
    macro avg 0.5298 0.4919 0.5037 45820
    weighted avg 0.5299 0.5348 0.5269 45820
    • Confusion matrix for test data:
    Confusion matrix for test data

Releases

No releases published

Packages

No packages published