Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

英文数词被标记为 CN_WORD #1046

Open
stormyi opened this issue Feb 27, 2024 · 3 comments
Open

英文数词被标记为 CN_WORD #1046

stormyi opened this issue Feb 27, 2024 · 3 comments

Comments

@stormyi
Copy link

stormyi commented Feb 27, 2024

Description

ik_smart(v7.10.0) 对英文数词+中文量词 组合的分词效果与预期不符,特别不能理解的是"7天"的 7 为什么是 CN_WORD?
(ps:相同环境在 es 7.10.2 + ik 7.10.2 英文数词+中文量词 被标记为 TYPE_CQUAN)

Steps to reproduce

POST /_analyze
'{"field":"content","analyzer":"ik_smart","text":"7天 44天 55天"}'

Expected behavior

"7天"应该是一个 TYPE_CQUAN
{
"token": "7天",
"start_offset": 0,
"end_offset": 2,
"type": "TYPE_CQUAN",
"position": 0
}

Actual behavior

{
"token": "7",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 0
},
{
"token": "天",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
}

Environment

  • Versions: Elasticsearch 7.10.0
@stormyi
Copy link
Author

stormyi commented Feb 28, 2024

按照我的预期,"7天"应该是一个 token,而不是被拆分为"7"和"天"。看起来是因为"7"被认为是一个 CN_WORD,所以没有和"天"组合。有人能解惑一下吗

@stormyi
Copy link
Author

stormyi commented Feb 28, 2024

@medcl

@medcl
Copy link
Member

medcl commented Feb 29, 2024

嗯,这块是需要优化,IK 项目近期会整理遗留的 Bug,我们会继续完善。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants