Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

计算段落SimHash不管设置多少位,结果都只有42位有效值,后面全部是0 #39

Open
lty2008one opened this issue Apr 10, 2023 · 3 comments

Comments

@lty2008one
Copy link

image

断点打到相似度计算中间发现的,simHash的每一个字符计算,最大位数也就只有42位,向量计算也就只有前42位有效,可能需要更换一下hash算法?
@shibing624
Copy link
Owner

限制了128位,如果文本短就用前面的位,文本长就继续加,最长表示到128位。

@lty2008one
Copy link
Author

lty2008one commented Apr 14, 2023

限制了128位,如果文本短就用前面的位,文本长就继续加,最长表示到128位。

我测试的结果是,文本会分词并计算每个分词的HASH值,同一位的HASH值会按照0减1加的趋势计算权重(好像权重全部都是1),最后得到的每一位按照正负判断为0还是1

但是每个分词的HASH值都超不过42位,最终的结果就绝对超不过42位啊

image

@shibing624

@shibing624
Copy link
Owner

好的,所以是觉得42位的效果差,想改为128或者更长的位数吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants