Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese segmentation not correct #226

Open
sivdead opened this issue Jul 5, 2023 · 2 comments
Open

Chinese segmentation not correct #226

sivdead opened this issue Jul 5, 2023 · 2 comments

Comments

@sivdead
Copy link

sivdead commented Jul 5, 2023

I notice that this program use jieba.cut to cut Chinese words,but it seems not works well at some time;
egg,use Chinese word 永永远远是龙的传人,jieba.cut will result to 永永远远/是/龙的传人, but when use jieba.cut_for_search, the result would be 永远/远远/永永远远/是/传人/龙的传人, I think its better for index search.

@sivdead
Copy link
Author

sivdead commented Jul 5, 2023

I can make a pr to solve this if you do think this should be fixed.

@curquiza curquiza added support Issues related to support questions and removed support Issues related to support questions labels Jul 11, 2023
@ManyTheFish
Copy link
Member

Hello @sivdead,
you're right, using cut_for_search would increase the recall of Meilisearch by splitting words in different ways.
However, Meilisearch relies on words position for queries, and Jieba.cut_for_search doesn't give any clues on the position of each token, moreover, charabia does not support shifting tokens.
In order to support this kind of position shifting behavior, the charabia output should be changed in a tree shape for instance 永永远远是龙的传人 would be shaped as:

永永远远 ──┬─► 是 ─┬─► 龙的传人
永远 ─────┤       └─► 传人
远远 ─────┘

Which is not possible without doing a huge job,
But I have to admit that it would enhance significantly the search recall.

Thank you for your report and sorry for the time to answer,

see you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants