VieTokenizer

This model architecture that we use is a simple bi-lstm network trained by unsupervised learning on a large pre-segmented dataset. The model will predict if serial word is 1 and non-serial is 0, for example, "Tôi tên là Nguyễn Tiến Đạt" will be equivalent to a sequence of numbers with both zero and one being [0, 0, 0, 0, 1, 1].

Installation 🎉

This repository is tested on python 3.7+ and Tensorflow 2.8+
VieTokenizer can be installed using pip as follows:

pip install vietokenizer 🍰

VieTokenizer can also be installed from source with the following commands:

git clone https://github.com/Nguyendat-bit/VieTokenizer
cd VieTokenizer
pip install -e .

Usage 🔥

>>> import vietokenizer
>>> tokenizer= vietokenizer.vntokenizer()
>>> tokenizer('Tôi tên là Nguyễn Tiến Đạt, hiện là sinh viên Đại học CN GTVT tại Hà Nội.')
'Tôi tên là Nguyễn_Tiến_Đạt , hiện là sinh_viên Đại_học CN GTVT tại Hà_Nội .'
>>> tokenizer('Kim loại nặng thường được định nghĩa là kim loại có khối lượng riêng, khối lượng nguyên tử hoặc số hiệu nguyên tử lớn.')
'Kim_loại nặng thường được định_nghĩa là kim_loại có khối_lượng riêng , khối_lượng nguyên_tử hoặc số_hiệu nguyên_tử lớn .'

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
vietokenizer		vietokenizer
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vietokenizer

vietokenizer

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

VieTokenizer

Installation 🎉

Usage 🔥

License

About

Releases 2

Packages

Languages

License

Nguyendat-bit/VieTokenizer

Folders and files

Latest commit

History

Repository files navigation

VieTokenizer

Installation 🎉

Usage 🔥

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages