Joint Word Segmentation and POS Tagging in Keras

A Keras implementation of a deep learning network to simultaneously perform Word Segmentation and Part-of-Speech (POS) Tagging introduced by Bouy et al. in the paper Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning.

Requirements

tensorflow==2.7.0

Config File Layout

{
    "training": {
        "batch_size": 128, // The batch size during training
        "learning_rate": 0.001 // The learning rate
    },
    "model": {
        "num_stacks": 2, // The number of LSTM layer stacks.
        "hidden_layers_dim": 100, // The number of units for each hidden LSTM layers.
        "max_sentence_length": 687 // The maximum number of characters in a sentence.
    }
}

Training on Custom Dataset

1. Dataset Format

This repo expects datasets as text files in the below format. The sentence and sentence_tag are separated by a \t character.

sentence  sentence_tag

Sample:

ផលិត^កម្ម	/NN/NS/NS/NS/NS/NS/NS/NS/NS
នេះគឺ_ជាទេព្យផល្គុន	/DT/NS/NS/VB/NS/NS/NS/NS/PN/NS/NS/NS/NS/PN/NS/NS/NS/NS/NS
...

2. Start training

python train.py config train_set char_map pos_map --shuffle=False --epochs=300 --output_dir=output

positional arguments:
  config                path to config file.
  train_set             path to training dataset.
  char_map              path to characters map file.
  pos_map               path to pos map file.

optional arguments:
  -h, --help                show this help message and exit.
  --shuffle [SHUFFLE]       whether to shuffle the dataset when creating the batch.
  --epochs EPOCHS           the number of epochs to train.
  --output_dir OUTPUT_DIR   path to output directory.

Evaluating on Custom Dataset

1. Dataset Format

This repo expects datasets as text files in the below format. The sentence and sentence_tag are separated by a \t character.

sentence  sentence_tag

Sample:

ផលិត^កម្ម	/NN/NS/NS/NS/NS/NS/NS/NS/NS
នេះគឺ_ជាទេព្យផល្គុន	/DT/NS/NS/VB/NS/NS/NS/NS/PN/NS/NS/NS/NS/PN/NS/NS/NS/NS/NS
...

2. Start Evaluation Process

python evaluate.py config test_set char_map pos_map weights --output_dir=output

positional arguments:
  config                path to config file.
  test_set              path to test dataset.
  char_map              path to characters map file.
  pos_map               path to pos map file.
  weights               path to weights file.

optional arguments:
  -h, --help                show this help message and exit
  --output_dir OUTPUT_DIR   path to output directory.

About Pretrained Weights

You can access a pretrained weights here. The network was trained for 12 epochs on a modified version of the khPOS's train.all2 dataset. The original data consists of 12000 sentences. However, for the pretrained weights, the sentences is splitted into sentences chunks. The resulting dataset consists of 2,172,051 samples. See utils/prepare_khpos_dataset.py to understand the data conversion process.

Converting Pretrained Weights

You can convert the pretrained weights into a consolidated Keras format or tflite using the below command

python convert.py config weights char_map pos_map --output_type=keras --output_dir=output

positional arguments:
  config                path to config file.
  weights               path to the weight file.
  char_map              path to characters map file.
  pos_map               path to pos map file.

optional arguments:
  -h, --help                  show this help message and exit.
  --output_dir OUTPUT_DIR     path to output directory.
  --output_type OUTPUT_TYPE   the type of the output model. One of type: "keras", "tflite"

Pretrained Weights Evaluation

Test Set	POS Tag	Tag Accuracy (%)	POS Tagging Accuracy (%)
khPOS OPEN-TEST	AB	100.00	94.09
	AUX	96.82
	CC	96.67
	CD	97.55
	DT	97.87
	IN	93.75
	JJ	80.39
	VB	91.44
	NN	95.17
	PN	93.88
	PA	75.68
	PRO	98.80
	QT	80.00
	RB	88.99
	SYM	97.81
khPOS CLOSE-TEST	AB	100.00	99.20
	AUX	100.00
	CC	99.52
	CD	100.00
	DT	100.00
	IN	99.81
	JJ	99.15
	VB	99.39
	NN	99.88
	PN	97.18
	PA	87.32
	PRO	99.74
	QT	100.00
	RB	99.14
	SYM	100.00

References

Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning. Retrieved from https://arxiv.org/abs/2103.16801
Loem, M. (2021, May 4). Joint Khmer Word Segmentation and POS tagging. Medium. Retrieved from https://towardsdatascience.com/joint-khmer-word-segmentation-and-pos-tagging-cad650e78d30
Ye, K. T., Vichet, C., & Yoshinori, S. (2017). Comparison of Six POS Tagging Methods on 12K Sentences Khmer Language POS Tagged Corpus. First Regional Conference on Optical character recognition and Natural language processing technologies for ASEAN languages (ONA 2017). Retrieved from https://github.com/ye-kyaw-thu/khPOS/blob/master/khpos.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
configs		configs
pretrained		pretrained
utils		utils
.gitignore		.gitignore
README.md		README.md
convert.py		convert.py
data_generator.py		data_generator.py
evaluate.py		evaluate.py
network.py		network.py
requirements.txt		requirements.txt
train.py		train.py

Socret360/joint-khmer-word-segmentation-and-pos-tagging

Folders and files

Latest commit

History

Repository files navigation

Joint Word Segmentation and POS Tagging in Keras

Requirements

Config File Layout

Training on Custom Dataset

1. Dataset Format

2. Start training

Evaluating on Custom Dataset

1. Dataset Format

2. Start Evaluation Process

About Pretrained Weights

Converting Pretrained Weights

Pretrained Weights Evaluation

References

About

Topics

Resources

Stars

Watchers

Forks

Languages