Releases: mingruimingrui/fast-mosestokenizer
Releases · mingruimingrui/fast-mosestokenizer
Version 0.0.8.2
Bugfixes
- Fixed underflow error in detokenization
- Fixed underflow error in trim function
Version 0.0.8.1
Changes
other_letters
option exposed in python API.
Version 0.0.8
Changes
- Segmentation by
\p{So}
not automatically enabled. - The performance of
\p{So}
segmentation drastically improved.
Version 0.0.7.2
Hotfix
Fixed regex.
Version 0.0.7.1
Hotfix
Hotfix for other_letters
since they might contain nonspacing mark
.
Version 0.0.6
Features
Improved tokenization rules for Logogram languages
Version 0.0.5
Features
- Installation of the C++ library and command-line tools can finally be done using
make install
make build-cli
has been changed tomake build
Bug fixes
- Capture case where
in_num_p
is not switched off.- Before:
"文字123汉语" -> ["文字", "123", "汉", "语"]
- After:
"文字123汉语" -> ["文字", "123", "汉语"]
- Before:
Todo
- To determine how characters belonging to the "other letters" category
should be handled by the tokenizer. - Reduce the number of flags.
- Remove those out of the scope of this package. Eg. lowercase
- Or adds unnecessary bloat to the logic. Eg. url handling
Version 0.0.4
- Fixed detokenization for "@-@"
- Now builds Linux images using base ubuntu:16.04
Version 0.0.3
Fix for github workflow.
Version 0.0.2
- Build static libs locally
- Build python packages with static lib