Skip to content

u8621011/pyVitk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyVitk

Python Version Vietnamese Text Processing Toolkit.

API

  • tokenizeLine: tokenize the line of vietnamese sentence into tokens
    This vietnamese tokenzier is porting from vn.vitk of Lê Hồng Phương.
    The original vn.vitk project is here.

Usage

from pyVitk import Tokenizer

t = Tokenizer()
sentence = "bài viết chọn lọc alt hình ảnh chọn lọc"
tokens = t.tokenizeLine(sentence, concat=True)

print("tokenize result: {}".format(str(tokens)))

t.to_lexicon_xml_file('xml_filename_to_serialize_lexicons')
  • crawlers samples

Usage

from pyVitk import crawler
import json

# support zh-TW to vi-VN currently. will return DictionaryLexicon structure
results = crawler.parse_vdict('zh-TW', 'vi-VN', '中文')
results_y2k = crawler.parse_vny2k('中文')

print(json.dumps(results.__dict__))
print(json.dumps(results_y2k.__dict__))

About

Python Version Vietnamese Text Processing Toolkit

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages