Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hl_title and hl_subtitle fields defined without stopword query preprocessor #95

Open
MikeYalter opened this issue May 28, 2019 · 0 comments

Comments

@MikeYalter
Copy link
Contributor

Fields hl_title and hl_subtitle are defined as stringTokens which is a defined type based on solr.TextField.
The definition contains a number of query analyzers:

    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>

The content_citation field is defined as a text type, and is defined similarly to stringTokens but has additional index and query analyzers.
The query ones are:

      <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <filter class="solr.LengthFilterFactory" min="2" max="100"/>
      </analyzer>

most notably the <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> filter seems to cause some performance issues, doing the same query against the two fields has on the order of 5 times performance difference:
/select?q=(content_citation:(Technology development for identification of citrus Citrus spp rootstocks based on Sequence Tagged Microsatellite marker)) results in :

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime"> 1491 </int>
</lst>
<result name="response" numFound="7235278" start="0">

whereas the hl_title query:
/select?q=(hl_title:(Technology development for identification of citrus Citrus spp rootstocks based on Sequence Tagged Microsatellite marker))

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">6675</int>
</lst>
<result name="response" numFound="55123772" start="0">

removing the stopwords manually:
select?q=(hl_title:(Technology development identification citrus Citrus spp rootstocks based Sequence Tagged Microsatellite marker))
results in :

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1419</int>
</lst>
<result name="response" numFound="4798657" start="0">

Adding the stopwords query filter to the stringTokens definition might be worthwhile.
note
Adding the filter does not remove the stopwords from the field or from quoted queries.
Modifying the field definition will affect other fields of that type, may need to assess impact.
Changing the field definition to type text would likely require a reindex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant