SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation)

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020)
CRF as Stacked Model and DeepCut as Baseline model

Citation

@inproceedings{limkonchotiwat-etal-2020-domain,
    title = "Domain Adaptation of {T}hai Word Segmentation Models using Stacked Ensemble",
    author = "Limkonchotiwat, Peerat  and
      Phatthiyaphaibun, Wannaphong  and
      Sarwar, Raheem  and
      Chuangsuwanich, Ekapol  and
      Nutanong, Sarana",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.315",
}

Install

pip install sefr_cut

How To use

Requirements

python >= 3.6
python-crfsuite >= 0.9.7
pyahocorasick == 1.4.0

Example

Example files are on SEFR Example notebook
Try it on Colab

Load Engine & Engine Mode

ws1000, tnhc, and BEST !!
- ws1000: The model trained on Wisesight-1000 and test on Wisesight-160
- tnhc: The model trained on TNHC (80:20 train&test split with random seed 42)
- BEST: The model trained on BEST-2010 Corpus (NECTEC)
```
sefr_cut.load_model(engine='ws1000')
# OR
sefr_cut.load_model(engine='tnhc')
# OR
sefr_cut.load_model(engine='best')
```
tl-deepcut-XXXX
- We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
```
sefr_cut.load_model(engine='tl-deepcut-ws1000')
# OR
sefr_cut.load_model(engine='tl-deepcut-tnhc')
```
deepcut
- We also provide the original deepcut
```
sefr_cut.load_model(engine='deepcut')
```

Segment Example

You need to read the paper to understand why we have $k$ value!

Tokenize with default k-value

sefr_cut.load_model(engine='ws1000')
print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ']))
print(sefr_cut.tokenize(['สวัสดีประเทศไทย']))
print(sefr_cut.tokenize('สวัสดีประเทศไทย'))

[['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
[['สวัสดี', 'ประเทศ', 'ไทย']]
[['สวัสดี', 'ประเทศ', 'ไทย']]

Tokenize with a various k-value

sefr_cut.load_model(engine='ws1000')
print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number
print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number

[['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']]
[['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]

Evaluation

We also provide Character & Word Evaluation by call function evaluation()

For example

answer = 'สวัสดี|ประเทศไทย'
pred = 'สวัสดี|ประเทศ|ไทย'
char_score,word_score = sefr_cut.evaluation(answer,pred)
print(f'Word Score: {word_score} Char Score: {char_score}')

Word Score: 0.4 Char Score: 0.8

answer = ['สวัสดี|ประเทศไทย']
pred = ['สวัสดี|ประเทศ|ไทย']
char_score,word_score = sefr_cut.evaluation(answer,pred)
print(f'Word Score: {word_score} Char Score: {char_score}')

Word Score: 0.4 Char Score: 0.8


answer = [['สวัสดี|'],['ประเทศไทย']]
pred = [['สวัสดี|'],['ประเทศ|ไทย']]
char_score,word_score = sefr_cut.evaluation(answer,pred)
print(f'Word Score: {word_score} Char Score: {char_score}')

Word Score: 0.4 Char Score: 0.8

Performance

How to re-train the model?

You can re-train the model. The example is in the folder Notebooks We provided everything for you!!
Re-train Model
- You can run the notebook file #2, the corpus inside 'Notebooks/corpus/' is Wisesight-1000, you can try with BEST, TNHC, and LST20 !
- Rename variable name: CRF_model_name
- Link:HERE
Filter and Refine Example
- Set variable name CRF_model_name same as file#2
- If you want to know why we use filter-and-refine, you can try to uncomment 3 lines in score_() function
```
#answer = scoring_function(y_true,cp.deepcopy(y_pred),entropy_index_og)
#f1_hypothesis.append(eval_function(y_true,answer))
#ax.plot(range(start,K_num,step),f1_hypothesis,c="r",marker='o',label='Best case')
```
- Link:HERE
Use your trained model?
- Just move your model inside 'Notebooks/model/' to 'seft_cut/model/' and call model in one line.
```
SEFR_CUT.load_model(engine='my_model')
```

Thank you many code from

Deepcut (Baseline Model) : We used some of code from Deepcut to perform transfer learning
@bact (CRF training code) : We used some from https://github.com/bact/nlp-thai

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
Notebooks		Notebooks
sefr_cut		sefr_cut
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
custom_dict.txt		custom_dict.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

License

mrpeerat/SEFR_CUT

Folders and files

Latest commit

History

Repository files navigation

SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation)

Read more:

Citation

Install

How To use

Requirements

Example

Load Engine & Engine Mode

Segment Example

Evaluation

Performance

How to re-train the model?

Re-train Model

Filter and Refine Example

Use your trained model?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages