Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ik_smart和ik_max_word分词差异怎么解决 #992

Open
whldoudou opened this issue Jan 11, 2023 · 6 comments
Open

ik_smart和ik_max_word分词差异怎么解决 #992

whldoudou opened this issue Jan 11, 2023 · 6 comments

Comments

@whldoudou
Copy link

ik_max_word分词效果:

`GET _analyze
{
"text": ["52周"],
"analyzer": "ik_max_word"
}

{
"tokens" : [
{
"token" : "52",
"start_offset" : 0,
"end_offset" : 2,
"type" : "ARABIC",
"position" : 0
},
{
"token" : "周",
"start_offset" : 2,
"end_offset" : 3,
"type" : "COUNT",
"position" : 1
}
]
}
`

ik_smart分词效果:

`GET _analyze
{
"text": ["52周"],
"analyzer": "ik_smart"
}

{
"tokens" : [
{
"token" : "52周",
"start_offset" : 0,
"end_offset" : 3,
"type" : "TYPE_CQUAN",
"position" : 0
}
]
}
`
问题是:ik_max_word 识别不出来TYPE_CQUAN类型的词,请问有解决方案没有?

@whldoudou
Copy link
Author

/**
* 组合词元
*/
private void compound(Lexeme result){

	if(!this.cfg.isUseSmart()){
		return ;
	}
	//数量词合并处理
	if(!this.results.isEmpty()){

		if(Lexeme.TYPE_ARABIC == result.getLexemeType()){
			Lexeme nextLexeme = this.results.peekFirst();
			boolean appendOk = false;
			if(Lexeme.TYPE_CNUM == nextLexeme.getLexemeType()){
				//合并英文数词+中文数词
				appendOk = result.append(nextLexeme, Lexeme.TYPE_CNUM);
			}else if(Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()){
				//合并英文数词+中文量词
				appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
			}
			if(appendOk){
				//弹出
				this.results.pollFirst(); 
			}
		}
		
		//可能存在第二轮合并
		if(Lexeme.TYPE_CNUM == result.getLexemeType() && !this.results.isEmpty()){
			Lexeme nextLexeme = this.results.peekFirst();
			boolean appendOk = false;
			 if(Lexeme.TYPE_COUNT == nextLexeme.getLexemeType()){
				 //合并中文数词+中文量词
				appendOk = result.append(nextLexeme, Lexeme.TYPE_CQUAN);
			}  
			if(appendOk){
				//弹出
				this.results.pollFirst();   				
			}
		}

	}
}

问题就出现在组合词元中的数量词合并处理这块,为什么ik_max_word不进行数量词的合并呢?是有那方面的考量吗?

@whldoudou
Copy link
Author

@medcl

@hongyan1110
Copy link

@medcl 我也遇到同样的问题,理论上ik_smart应该为ik_max_word分词的子集

@crossmaya
Copy link

我也遇到这个问题,刚准备提issue,请问你解决了吗?一摸一样的问题

@medcl
Copy link
Member

medcl commented Nov 23, 2023

ik_smart和算法不一样,不一定是子集。

@hongyan1110
Copy link

ik_smart和算法不一样,不一定是子集。

那就是说如果我使用 ik_smart 分词器搜索 ik_max_word 分词的数据,就不一定能搜索到。
但是对于 ES ,索引数据使用细粒度的分词器,搜索使用粗粒度的分词器,效果才是好的。那对于 ik分词器来说,有没有这样可以搭配使用的分词器呢? @medcl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants