Skip to content

Latest commit

 

History

History
59 lines (45 loc) · 3.14 KB

README.md

File metadata and controls

59 lines (45 loc) · 3.14 KB

CRF-Cut: Sentence Segmentation

The objective of CRF-Cut (Conditional Random Fields - Cut) is to cut sentences and we will able to utilize these sentences.

The process of training is to get sentences and we will tokenize words and assign label for each word I: Inside of sentence and E: End of sentence.

The result of CRF-Cut is trained by different datasets are as follows:

dataset-train dataset-validate I-precision I-recall I-fscore E-precision E-recall E-fscore space-correct
Ted Ted 0.99 0.99 0.99 0.74 0.70 0.72 0.82
Ted Orchid 0.95 0.99 0.97 0.73 0.24 0.36 0.73
Ted Fake review 0.98 0.99 0.98 0.86 0.70 0.77 0.78
Orchid Ted 0.98 0.98 0.98 0.56 0.59 0.58 0.71
Orchid Orchid 0.98 0.99 0.99 0.85 0.71 0.77 0.87
Orchid Fake review 0.97 0.99 0.98 0.77 0.63 0.69 0.70
Fake review Ted 0.99 0.95 0.97 0.42 0.85 0.56 0.56
Fake review Orchid 0.97 0.96 0.96 0.48 0.59 0.53 0.67
Fake review Fake review 1 1 1 0.98 0.96 0.97 0.97
Ted + Orchid + Fake review Ted 0.99 0.98 0.99 0.66 0.77 0.71 0.78
Ted + Orchid + Fake review Orchid 0.98 0.98 0.98 0.73 0.66 0.69 0.82
Ted + Orchid + Fake review Fake review 1 1 1 0.98 0.95 0.96 0.96

Google colab:

Sentence Breaking Journal

What doesn't work

  • POS-perceptron
  • Larger features than window = 2, max_n_gram = 3
  • Number of verbs to the left and right
  • Rule-based override
  • L2 regularization - also not practical
  • POS-artagger - not really too slow
  • ORCHID - different domains get totally different results

What to try

  • TNC

What worked

  • Fake "convolutions" of window = 2, max_n_gram = 3
  • L1 regularization of 1
  • Predict end of sentence (space) instead of beginning of sentence
  • Custom POS - only faster convergence
  • Try with ORCHID to compare performance more fairly - 87% vs 95% SOTA

Requirements

  • pythainlp
  • python-crfsuite
  • pandas
  • numpy
  • scikit-learn
  • tqdm