Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-iterators of ConjunctionDISI are not on the same document #13616

Open
ChapterSevenSeeds opened this issue May 9, 2024 · 13 comments
Open
Labels
bug Something isn't working Search Search query, autocomplete ...etc

Comments

@ChapterSevenSeeds
Copy link

Describe the bug

When performing an intervals query on a field with a custom mapping and analyzer, I get an illegal argument exception with the following reason: Sub-iterators of ConjunctionDISI are not on the same document!. I am not sure if the error is due to an issue with our custom mapping or analyzers, or if it caused by a bug somewhere. Any insight is greatly appreciated.

Related component

Search

To Reproduce

  1. Run OpenSearch in a new Docker container
docker run -d -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "OPENSEARCH_INITIAL_ADMIN_PASSWORD=password_here" -e 'DISABLE_SECURITY_PLUGIN=true' opensearchproject/opensearch:latest
  1. Create a new index with the custom mapping and analyzer
PUT http://localhost:9200/my_index
{
    "mappings": {
        "dynamic_templates": [
            {
                "SearchableFilter": {
                    "match": "*_SearchableFilter",
                    "match_mapping_type": "string",
                    "mapping": {
                        "fields": {
                            "keyword": {
                                "ignore_above": 256,
                                "type": "keyword"
                            }
                        },
                        "search_analyzer": "text_general_search",
                        "type": "text"
                    }
                }
            }
        ]
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "text_general_search": {
                    "filter": [
                        "stop",
                        "lowercase"
                    ],
                    "type": "custom",
                    "tokenizer": "standard"
                }
            }
        }
    }
}
  1. Create a new document
POST http://localhost:9200/my_index/_doc
{
    "siblings_SearchableFilter": [
        "a Sister"
    ]
}
  1. Perform the following intervals query
POST http://localhost:9200/my_index/_search
{
    "query": {
        "intervals": {
            "siblings_SearchableFilter": {
                "all_of": {
                    "intervals": [
                        {
                            "any_of": {
                                "intervals": [
                                    {
                                        "match": {
                                            "query": "a"
                                        }
                                    },
                                    {
                                        "match": {
                                            "query": "b"
                                        }
                                    }
                                ]
                            }
                        },
                        {
                            "match": {
                                "query": "sister"
                            }
                        }
                    ],
                    "max_gaps": 30,
                    "ordered": true
                }
            }
        }
    }
}
  1. Notice you get a 400 response code with the error mentioned above.

Expected behavior

I would expect the intervals query to succeed and return the document created in the reproduction steps.

Additional Details

  • The OpenSearch version is the latest since we are running the Docker image opensearchproject/opensearch:latest.
  • In my local testing, I ran the image in Docker on WSL running on Windows 11. However, this error also occurs on machines running Ubuntu 22 whose GET response is the following:
{
    "name": "pss-cluster-coordinating-01",
    "cluster_name": "prod_cluster",
    "cluster_uuid": "FFQa8G3RRwqNdct-KMih4g",
    "version": {
        "distribution": "opensearch",
        "number": "2.11.0",
        "build_type": "tar",
        "build_hash": "4dcad6dd1fd45b6bd91f041a041829c8687278fa",
        "build_date": "2023-10-13T02:55:55.511945994Z",
        "build_snapshot": false,
        "lucene_version": "9.7.0",
        "minimum_wire_compatibility_version": "7.10.0",
        "minimum_index_compatibility_version": "7.0.0"
    },
    "tagline": "The OpenSearch Project: https://opensearch.org/"
}
  • In my testing, there are a few things that remove the error, each of which also prevent the query from functioning properly and returning the desired document:
    1. Removing max_gaps from the query or setting it to -1.
    2. Removing the custom text_general_search analyzer from the mapping.
    3. Removing either stop or lowercase from the analyzer filter.
@ChapterSevenSeeds ChapterSevenSeeds added bug Something isn't working untriaged labels May 9, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label May 9, 2024
@expani
Copy link

expani commented May 15, 2024

Thanks for filing this issue @ChapterSevenSeeds

I am able to reproduce the issue with the steps you mentioned. Will take a stab at it and update as I get time.

@getsaurabh02
Copy link
Member

Thanks @expani . Removing untriaged label.

@expani
Copy link

expani commented May 15, 2024

The match rule

"match": {
    "query": "a"
}

gets converted into an empty query ( No Match ) due to the custom analyser removing stop words.

The final IntervalsSource created for the IntervalQuery here is a DisjunctionIntervalsSource and looks like below

or( 
 MAXGAPS/30(ORDERED(b,sister)),
 MAXGAPS/30(ORDERED(no_match,sister)) 
)

OpenSearch represents the IntervalsIterator for no_match using NO_INTERVALS which represents un-positioned docId using NO_MORE_DOCS -> 2147483647 here

The intervals iterator for the rule

"match":  {
    "query": "sister"
}

is a based on PostingsEnum which represents an un-positioned docId using -1

When both these iterators are matched, they differ which leads to the error reported being thrown here

Lucene also represents an IntervalsIterator for no_match called NO_INTERVALS but it represents un-positioned docIds using -1 Reference This was fixed a few years back in Lucene by this PR

I made the same changes to the NO_INTERVALS iterator used by OpenSearch and the query mentioned above worked as expected.

The diff is

diff --git a/server/src/main/java/org/opensearch/index/query/IntervalBuilder.java b/server/src/main/java/org/opensearch/index/query/IntervalBui
lder.java
index 0e42e79f67d..fa21e6fbd82 100644
--- a/server/src/main/java/org/opensearch/index/query/IntervalBuilder.java
+++ b/server/src/main/java/org/opensearch/index/query/IntervalBuilder.java
@@ -266,6 +266,8 @@ public class IntervalBuilder {
         @Override
         public IntervalIterator intervals(String field, LeafReaderContext ctx) {
             return new IntervalIterator() {
+
+                boolean exhausted = false;
                 @Override
                 public int start() {
                     return NO_MORE_INTERVALS;
@@ -293,16 +295,18 @@ public class IntervalBuilder {

                 @Override
                 public int docID() {
-                    return NO_MORE_DOCS;
+                    return exhausted ? NO_MORE_DOCS : -1;
                 }

                 @Override
                 public int nextDoc() {
+                    exhausted = true;
                     return NO_MORE_DOCS;
                 }

                 @Override
                 public int advance(int target) {
+                    exhausted = true;
                     return NO_MORE_DOCS;

We should make the NO_INTERVALS used by Lucene public and use the same in OpenSearch instead of creating a copy to avoid any such issues in future.

@jainankitk
Copy link
Contributor

@expani - Thank you root causing this issue. Completely agree on reusing the NO_INTERVALS iterator in Lucene instead of duplicating the code in OpenSearch. Have you created issue/PR with lucene for addressing this?

@expani
Copy link

expani commented May 20, 2024

@jainankitk Yes opened this PR in Lucene

@jainankitk
Copy link
Contributor

@jainankitk Yes opened this apache/lucene#13385 in Lucene

Thank you for the lucene PR!

@uschindler
Copy link
Contributor

This problem is often seen with patent search. I opened apache/lucene#13388

There are more issues, but making that constant publicly available is wrong. There should be a method to create an empty interval with a reason.

@uschindler
Copy link
Contributor

P.S.: I tend to revert above PR. It does not fit the Lucene API patterns.

@uschindler
Copy link
Contributor

P.S.: The workaround that can be used already in OpenSearch code is the following factory method: Intervals.atLeast(1)

This requires at least one match for an empty list of intervals, so it rewrites to a NoMatchIntervalsSource which is a duplicate of above constant. The given issue will also merge those implementations.

@expani
Copy link

expani commented May 20, 2024

Agree on putting this behind an API.

Returning a null iterator might not be properly checked at all places of usage so having NO_INTERVALS iterator returning NO_MORE_DOCS seems more safe of a change.

@uschindler
Copy link
Contributor

I made a comment because of that in the PR. We need more tests, I agree.

Up to now this atLeast(1) has always worked for me.

Null is always used in lucene for iterators that fo not match anything.

@uschindler
Copy link
Contributor

@expani
Copy link

expani commented May 20, 2024

Yes returning null worked for this case in OpenSearch as well, but NO_INTERVAL is being used extensively across Lucene and not all the usage seem to be covered by UTs.

I added a comment to support both style of iterators null and NO_INTERVAL in the PR to make sure the changes are safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search Search query, autocomplete ...etc
Projects
Status: 🆕 New
Development

No branches or pull requests

5 participants