Skip to content

A tokenization tool for tokenizing sentences using several tokenizers.

License

Notifications You must be signed in to change notification settings

midobal/tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Tokenizer

This tool provides several methods for de/tokenizing a sentence using:

Requirements

Sentencepiece, Sacremoses and BPE

pip install sentencepiece sacremoses subword-nmt

Mecab

mkdir -p tokenizers
git clone https://github.com/midobal/mecab.git
mkdir tokenizers/mecab
export PTH=$(pwd)
cd mecab/mecab
./configure --prefix="$PTH"/tokenizers/mecab --with-charset=utf8
make install
cd ../mecab-ipadic
./configure --with-mecab-config=../mecab/mecab-config --prefix="$PTH"/tokenizers/mecab --with-charset=utf8
make install
cd "$PTH"
rm -rf mecab

Stanford Word Segmenter

mkdir -p tokenizers
wget https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip
unzip stanford-segmenter-2018-10-16.zip
mv stanford-segmenter-2018-10-16 tokenizers/stanford_segmenter
rm stanford-segmenter-2018-10-16.zip

About

A tokenization tool for tokenizing sentences using several tokenizers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages