Korean-POS-Tagger-LSTM

Korean POS Tagger Using Character-Level Sequence to Sequence Model

[1] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks." Advances in NIPS (2014), https://arxiv.org/abs/1409.3215.

[2] K. Cho, and et al., "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.“, https://arxiv.org/pdf/1406.1078.pdf

[3] https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py

Model is same as [3] but diffrent application.

Character-Level Sequence to Sequence Model:

Input sequences

Korean text

Corresponding Korean POS-tag text

An encoder LSTM turns input sequences to 2 state vectors

Preserve the last LSTM state and discard the outputs

A decoder LSTM is trained to the target POS-tag into the same sequence

Keep offset by one time-step for future

The offset uses as initial state the state vectors from encoder

The decoder learns to generate POS-tag [t+1] by a given POS-tag [t]

Inference from trained model:

Encoder input sequences

Encode the Korean text sequence into state vectors

Start with a target sequence of size 1 (sequence of characters)

Feed the state vectors and each character target sequence to the decoder

To produce predictions for the next character

Sample the next character using these predictions

Apply argmax

Append the sampled character to the target sequence

Repeating until to reach the end-of-sequence character

Prerequisite:

keras

numpy

pickle

For train script:

python train.py --train_file train.txt

For test script:

python test.py --input_file test.txt --output_file result.txt

Files: LSTM_KR_PoS.py : Character embedding model for Korean part of speech tagging

train.py : Training script

test.py: Testing script

For testing with already train model put all of the saved model files in a same directory as test.py

Model File List:

decoder_model_pos_kr_v10000.json

decoder_model_weights_pos_kr_v10000.h5

encoder_input_data_v.data

encoder_model_pos_kr_v10000.json

encoder_model_weights_pos_kr_v10000.h5

input_texts_v.data

input_token_index_v.data

max_decoder_seq_length_v.data

num_decoder_tokens_v.data

s2s_pos_kr_v10000.h5

target_token_index_v.data

Train with valid.txt:

(Due to memory resource constraint in my computer, current saved model trained with a first 10000 data point from valid.txt files )

Train with train.txt:

"Train_Save_Model_10000" folder contain all learned model using train.txt with a first 10000 data point.

Model files Names:

decoder_model_pos_kr_10000.json

decoder_model_weights_pos_kr_10000.h5

encoder_input_data.data

encoder_model_pos_kr_10000.json

encoder_model_weights_pos_kr_10000.h5

input_texts.data

input_token_index.data

max_decoder_seq_length.data

num_decoder_tokens.data

s2s_pos_kr_10000.h5

target_token_index.data

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
LSTM_Flow.PNG		LSTM_Flow.PNG
LSTM_KR_PoS.py		LSTM_KR_PoS.py
Out.PNG		Out.PNG
README.md		README.md
READ_ME.txt		READ_ME.txt
result.txt		result.txt
test.py		test.py
test.txt		test.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

LSTM_Flow.PNG

LSTM_Flow.PNG

LSTM_KR_PoS.py

LSTM_KR_PoS.py

Out.PNG

Out.PNG

README.md

README.md

READ_ME.txt

READ_ME.txt

result.txt

result.txt

test.py

test.py

test.txt

test.txt

train.py

train.py

Repository files navigation

Korean-POS-Tagger-LSTM

Prerequisite:

For train script:

For test script:

About

Releases

Packages

Languages

License

munir-bd/Korean-POS-Tagger-LSTM

Folders and files

Latest commit

History

Repository files navigation

Korean-POS-Tagger-LSTM

Prerequisite:

For train script:

For test script:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages