Investigate self-attention network for Chinese Word Segmentation.
Models and results can be found at our paper Investigate self-attention network for Chinese Word Segmentation.
Python: 3.6.2
PyTorch: 1.0.1
CoNLL format (prefer BMES tag scheme), with each character its label for one line. Sentences are splited with a null line.
中 B-SEG
国 E-SEG
最 B-SEG
大 E-SEG
氨 B-SEG
纶 M-SEG
丝 E-SEG
生 B-SEG
产 E-SEG
基 B-SEG
地 E-SEG
在 S-SEG
连 B-SEG
云 M-SEG
港 E-SEG
建 B-SEG
成 E-SEG
新 B-SEG
华 M-SEG
社 E-SEG
北 B-SEG
京 E-SEG
十 B-SEG
二 M-SEG
月 E-SEG
二 B-SEG
十 M-SEG
六 M-SEG
日 E-SEG
电 S-SEG
- Character embeddings: gigaword_chn.all.a2b.uni.ite50.vec
- Character bigram embeddings: gigaword_chn.all.a2b.bi.ite50.vec, same folder with character embeddings.
- Download the character embeddings, character bigram embeddings and set their directories in
main.py
. - Modify the
run_seg.sh
by adding your train/dev/test file directory. sh run_seg.sh
Cite our paper as:
@article{gan2019investigating,
title={Investigating Self-Attention Network for Chinese Word Segmentation},
author={Gan, Leilei and Zhang, Yue},
journal={arXiv preprint arXiv:1907.11512},
year={2019}
}