Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

分词分出了空格 #19

Open
mhfc007 opened this issue Nov 26, 2016 · 3 comments
Open

分词分出了空格 #19

mhfc007 opened this issue Nov 26, 2016 · 3 comments
Labels

Comments

@mhfc007
Copy link

mhfc007 commented Nov 26, 2016

按照Readme.md配置

但是分词分出了 " " (空格) 也分出了 "的" 还有标点符号 怎么样才能把这些词过滤掉呢?

@hankcs
Copy link
Owner

hankcs commented Nov 26, 2016

@xuxucode
Copy link

xuxucode commented Dec 28, 2017

配置 stopWordDictionaryPath 为 stopwords_hanlp.txt 之后,只能过滤掉一个空格,如果连续两个空格就会出现 [2020], 配置如下:

  <analyzer type="index">
    <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" 
      enableIndexMode="true" 
      stopWordDictionaryPath="/var/solr/stopwords_hanlp.txt"
    />
  </analyzer>

错误结果如图,中间出现[2020],请问 “[2020]” 是什么字符?
hanlp_stopwords

尝试通过 solr.StopFilterFactory filter 来过滤字符,但是问题依旧,[20][2020]都过滤不了,配置如下:

  <analyzer type="index">
    <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory"
      enableIndexMode="true"
    />
    <filter class="solr.StopFilterFactory" 
      ignoreCase="true" 
      words="/var/solr/stopwords_hanlp.txt"
    />
  </analyzer>

最终导致“空格”成为索引最多的字符:

hanlp_stopwords_term

@hankcs
Copy link
Owner

hankcs commented Dec 28, 2017

  1. 分词的定义是将原文拆分为片段,不负责预处理。
  2. 分词必须分出空格,否则highlight会错位。这个准则同样适用于其他字符,如制表符、换行符等等。
  3. 如果不希望任何片段出现在index中,可以用停用词机制来实现。
  4. 20是十六进制的空格,要过滤它,停用词词典里应该敲空格,不应该敲20。
  5. 这些符号的词性一般标注为w,可以写代码自己过滤。以后可能会支持配置过滤特定词性,但这个功能太简单,没有多少动力去做。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants