Skip to content

Navigation Menu

Explore
For
- Enterprise
- Teams
- Startups
- Education
By Solution
Resources
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

mingruimingrui / fast-mosestokenizer Public

Notifications
Fork 9
Star 16

Code
Issues 2
Pull requests 1
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: mingruimingrui/fast-mosestokenizer

Releases · mingruimingrui/fast-mosestokenizer

Version 0.0.8.2

29 Oct 11:15

Compare

Choose a tag to compare

Version 0.0.8.2 Latest

Latest

Bugfixes

Fixed underflow error in detokenization
Fixed underflow error in trim function

Assets 2

All reactions

Version 0.0.8.1

14 Aug 17:19

Compare

Choose a tag to compare

Version 0.0.8.1 Pre-release

Pre-release

Changes

other_letters option exposed in python API.

Assets 2

All reactions

Version 0.0.8

13 Aug 12:45

Compare

Choose a tag to compare

Version 0.0.8 Pre-release

Pre-release

Changes

Segmentation by \p{So} not automatically enabled.
The performance of \p{So} segmentation drastically improved.

Assets 2

All reactions

Version 0.0.7.2

06 Aug 13:24

Compare

Choose a tag to compare

Version 0.0.7.2 Pre-release

Pre-release

Hotfix

Fixed regex.

Assets 2

All reactions

Version 0.0.7.1

06 Aug 13:19

Compare

Choose a tag to compare

Version 0.0.7.1 Pre-release

Pre-release

Hotfix

Hotfix for other_letters since they might contain nonspacing mark.

Assets 2

All reactions

Version 0.0.6

06 Aug 09:41

Compare

Choose a tag to compare

Version 0.0.6 Pre-release

Pre-release

Features

Improved tokenization rules for Logogram languages

Assets 2

All reactions

Version 0.0.5

01 Aug 10:07

Compare

Choose a tag to compare

Version 0.0.5 Pre-release

Pre-release

Features

Installation of the C++ library and command-line tools can finally be done using make install
make build-cli has been changed to make build

Bug fixes

Capture case where in_num_p is not switched off.
- Before: "文字123汉语" -> ["文字", "123", "汉", "语"]
- After: "文字123汉语" -> ["文字", "123", "汉语"]

Todo

To determine how characters belonging to the "other letters" category
should be handled by the tokenizer.
Reduce the number of flags.
- Remove those out of the scope of this package. Eg. lowercase
- Or adds unnecessary bloat to the logic. Eg. url handling

Assets 2

All reactions

Version 0.0.4

17 Jul 19:46

Compare

Choose a tag to compare

Version 0.0.4 Pre-release

Pre-release

Fixed detokenization for "@-@"
Now builds Linux images using base ubuntu:16.04

Assets 2

All reactions

Version 0.0.3

15 Jul 04:06

Compare

Choose a tag to compare

Version 0.0.3 Pre-release

Pre-release

Fix for github workflow.

Assets 2

All reactions

Version 0.0.2

14 Jul 18:02

Compare

Choose a tag to compare

Version 0.0.2 Pre-release

Pre-release

Build static libs locally
Build python packages with static lib

Assets 2

All reactions

Previous 1 2 Next

Previous Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.