Skip to content

Some demo tokenizers especially for Chinese, including Maximum Matching, UniGram, HMM, CRF.

Notifications You must be signed in to change notification settings

wenhaofang/Tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tokenizer

This repository includes some demo tokenizers (especially for Chinese).

Note: The project refers to jieba

Methods:

  • method1 (DONE): Maximum Matching
  • method2 (DONE): UniGram
  • method3 (DONE): HMM
  • method4 (TODO): CRF

Catalog Description

+ datas
    + data1
        - dict.txt # Synchronize with `jieba/dict.txt`
    + data2
        - prob_emit.py  # Synchronize with `jieba/finalseg/prob_emit.py`
        - prob_start.py # Synchronize with `jieba/finalseg/prob_start.py`
        - prob_trans.py # Synchronize with `jieba/finalseg/prob_trans.py`
+ modules
    - module1
    - module2 # Refering to `jieba/__init__.py`
    - module3 # Refering to `jieba/finalseg/__init__.py`

Supplement: How to get the data files? Here is the explanation from jieba issue.

Start

PYTHONIOENCODING=utf-8 PYTHONPATH=. python main.py

About

Some demo tokenizers especially for Chinese, including Maximum Matching, UniGram, HMM, CRF.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages