Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seg single word and fix lucene error #836

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

SophieMay
Copy link

1.字典中无单字,但是词元冲突了,切分出相交词元的前一个词元中的单字(去掉注解)
2.修复由1带来的lucene底层报错:startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards

@medcl
Copy link
Member

medcl commented Nov 24, 2020

麻烦提供一个测试,谢谢。

@SophieMay
Copy link
Author

case:螺丝批及批头,使用了用户自定义词典

去掉注释后分词结果为:

螺丝批及批头 0-6 CN_WORD
螺丝批 0-3 CN_WORD
螺丝 0-2 CN_WORD
及 3-4 CN_CHAR
批 2-3 CN_CHAR
及 3-4 CN_CHAR
批头 4-6 CN_WORD

分词结果存入lucene底层时,会对分词的startOffset进行校验,由于及的startoffset为3,比后面的批字startoffset(为2 )大lucene会报错startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards

优化主要调整分词结果的顺序 collection.sort

螺丝批及批头 0-6 CN_WORD
螺丝批 0-3 CN_WORD
螺丝 0-2 CN_WORD
批 2-3 CN_CHAR
及 3-4 CN_CHAR
及 3-4 CN_CHAR
批头 4-6 CN_WORD

及字属于重复分词,对后续流程无影响,后续可进一步优化。

issue里的相同case:
得饶人处且饶人
黎明前的黑暗
宝剑锋从磨砺出

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants