Wordseg

Wicked fast word segmenter with a focus on splitting #hashtags.

Example

from wordseg import segment

segment('mannequinchallenge')
    # => (['mannequin', 'challenge'], 5.996932418552515e-11)

More Info

Because the "training" data was harvested from social media websites, this word segmenter is especially good as a hashtag splitter. It's also about 10x faster than wordsegment.

The speed derives from an implementation of the Viterbi algorithm I found posted on SO. The built-in dictionary was pulled from about 6GB of social media posts (English only). Tools for building your own dictionary are included in the bin folder.

Roadmap

Improve data set by including posts from a broader range of time and with more unique unigrams.
Include common bigrams or even trigrams to help segmentation be context-aware.
Beef-up the very minimal Viterbi implementation

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bin		bin
dist		dist
wordseg		wordseg
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
benchmark.py		benchmark.py
hashtags.txt		hashtags.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

dist

dist

wordseg

wordseg

.gitignore

.gitignore

LICENSE

LICENSE

README.rst

README.rst

benchmark.py

benchmark.py

hashtags.txt

hashtags.txt

setup.py

setup.py

Repository files navigation

Wordseg

Example

More Info

Roadmap

About

Releases

Packages

Languages

License

jchook/wordseg

Folders and files

Latest commit

History

Repository files navigation

Wordseg

Example

More Info

Roadmap

About

Resources

License

Stars

Watchers

Forks

Languages