Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenize 一人(ひとり,hitori)will be seperate as 一(いち,ichi) 人(ひと,hito) #125

Open
andy840119 opened this issue Feb 28, 2018 · 1 comment

Comments

@andy840119
Copy link

andy840119 commented Feb 28, 2018

This project is great and useful for me : )
but i have a little question.
.
I'am not sure if it should be seem as a bug or not.
but some words like 一人(ひとり,hitori)will be separate as two words 一(いち,ichi) 人(ひと,hito)

@akkikiki
Copy link
Contributor

Hi,

I recommend looking at the output of the "Viterbi" option available at https://www.atilika.com/en/kuromoji/ to see what's going on. It seems that for IPADic (default dictionary) there is a connection weight that highly values the connection between 数 and 接尾 (i.e., regarding it as a number + 人).
If you look at the result using UniDic, it outputs "一人" (ひとり) so the naive solution is to simply switch using UniDic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants