Skip to content

A Keras implementation of a deep learning network to simultaneously perform Word Segmentation and Part-of-Speech (POS) Tagging introduced by Bouy et al. in the paper Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning.

Socret360/joint-khmer-word-segmentation-and-pos-tagging

Repository files navigation

Joint Word Segmentation and POS Tagging in Keras

A Keras implementation of a deep learning network to simultaneously perform Word Segmentation and Part-of-Speech (POS) Tagging introduced by Bouy et al. in the paper Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning.

Requirements

tensorflow==2.7.0

Config File Layout

{
    "training": {
        "batch_size": 128, // The batch size during training
        "learning_rate": 0.001 // The learning rate
    },
    "model": {
        "num_stacks": 2, // The number of LSTM layer stacks.
        "hidden_layers_dim": 100, // The number of units for each hidden LSTM layers.
        "max_sentence_length": 687 // The maximum number of characters in a sentence.
    }
}

Training on Custom Dataset

1. Dataset Format

This repo expects datasets as text files in the below format. The sentence and sentence_tag are separated by a \t character.

sentence  sentence_tag

Sample:

ផលិត^កម្ម	/NN/NS/NS/NS/NS/NS/NS/NS/NS
នេះគឺ_ជាទេព្យផល្គុន	/DT/NS/NS/VB/NS/NS/NS/NS/PN/NS/NS/NS/NS/PN/NS/NS/NS/NS/NS
...

2. Start training

python train.py config train_set char_map pos_map --shuffle=False --epochs=300 --output_dir=output
positional arguments:
  config                path to config file.
  train_set             path to training dataset.
  char_map              path to characters map file.
  pos_map               path to pos map file.

optional arguments:
  -h, --help                show this help message and exit.
  --shuffle [SHUFFLE]       whether to shuffle the dataset when creating the batch.
  --epochs EPOCHS           the number of epochs to train.
  --output_dir OUTPUT_DIR   path to output directory.

Evaluating on Custom Dataset

1. Dataset Format

This repo expects datasets as text files in the below format. The sentence and sentence_tag are separated by a \t character.

sentence  sentence_tag

Sample:

ផលិត^កម្ម	/NN/NS/NS/NS/NS/NS/NS/NS/NS
នេះគឺ_ជាទេព្យផល្គុន	/DT/NS/NS/VB/NS/NS/NS/NS/PN/NS/NS/NS/NS/PN/NS/NS/NS/NS/NS
...

2. Start Evaluation Process

python evaluate.py config test_set char_map pos_map weights --output_dir=output
positional arguments:
  config                path to config file.
  test_set              path to test dataset.
  char_map              path to characters map file.
  pos_map               path to pos map file.
  weights               path to weights file.

optional arguments:
  -h, --help                show this help message and exit
  --output_dir OUTPUT_DIR   path to output directory.

About Pretrained Weights

You can access a pretrained weights here. The network was trained for 12 epochs on a modified version of the khPOS's train.all2 dataset. The original data consists of 12000 sentences. However, for the pretrained weights, the sentences is splitted into sentences chunks. The resulting dataset consists of 2,172,051 samples. See utils/prepare_khpos_dataset.py to understand the data conversion process.

Converting Pretrained Weights

You can convert the pretrained weights into a consolidated Keras format or tflite using the below command

python convert.py config weights char_map pos_map --output_type=keras --output_dir=output
positional arguments:
  config                path to config file.
  weights               path to the weight file.
  char_map              path to characters map file.
  pos_map               path to pos map file.

optional arguments:
  -h, --help                  show this help message and exit.
  --output_dir OUTPUT_DIR     path to output directory.
  --output_type OUTPUT_TYPE   the type of the output model. One of type: "keras", "tflite"

Pretrained Weights Evaluation

Test Set POS Tag Tag Accuracy (%) POS Tagging Accuracy (%)
khPOS OPEN-TEST AB 100.00 94.09
AUX 96.82
CC 96.67
CD 97.55
DT 97.87
IN 93.75
JJ 80.39
VB 91.44
NN 95.17
PN 93.88
PA 75.68
PRO 98.80
QT 80.00
RB 88.99
SYM 97.81
khPOS CLOSE-TEST AB 100.00 99.20
AUX 100.00
CC 99.52
CD 100.00
DT 100.00
IN 99.81
JJ 99.15
VB 99.39
NN 99.88
PN 97.18
PA 87.32
PRO 99.74
QT 100.00
RB 99.14
SYM 100.00

References

About

A Keras implementation of a deep learning network to simultaneously perform Word Segmentation and Part-of-Speech (POS) Tagging introduced by Bouy et al. in the paper Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning.

Topics

Resources

Stars

Watchers

Forks

Languages