Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] function score query returned an invalid (negative) score with multi match cross fields query #7860

Open
snikoyo opened this issue Jun 1, 2023 · 14 comments · May be fixed by #13829
Open
Labels
bug Something isn't working Search:Relevance

Comments

@snikoyo
Copy link

snikoyo commented Jun 1, 2023

Describe the bug

The cross_fields scoring type can produce negative scores when some documents are missing fields.

This bug was already reported for Elasticsearch and fixed with this MR for Elasticsearch >= 8.4.

We encounter it randomly on production for certain queries when we search a field that does not exist in all documents.
It appears deterministically for these search queries.

Error message:

{ "error": { "root_cause": [ { "type": "exception", "reason": "function score query returned an invalid score: -0.05202394 for doc: 1123896" } ], "type": "search_phase_execution_exception", "reason": "all shards failed", "phase": "query", "grouped": true, "failed_shards": [ { "shard": 0, "index": "my_index", "node": "my_node", "reason": { "type": "exception", "reason": "function score query returned an invalid score: -0.05202394 for doc: 1123896" } } ] }, "status": 500 }

The document number mentioned in the error message does not exist.

To Reproduce

The steps to reproduce are in the above mentioned bug report for Elasticsearch.

Expected behavior

The function score should never return a negative value.

Plugins

None.

Host/Environment (please complete the following information):

OpenSearch version: 2.5

@snikoyo snikoyo added bug Something isn't working untriaged labels Jun 1, 2023
@andrross andrross added the Search Search query, autocomplete ...etc label Jun 1, 2023
@sejli sejli removed the untriaged label Jun 7, 2023
@macohen
Copy link
Contributor

macohen commented Jun 29, 2023

In this case, a term is in every field in all the documents: "red" and it also appears in the field "shape" which only exists in one doc.

given n = 4 (number of documents containing the term) and N = 3 (total number of documents with the field), the IDF calculation: log(1 + (N - n + 0.5) / (n + 0.5)) where n = 4, documents containing term and N = 3: log(1 + (3 - 4 + 0.5)/( 4 + 0.5)) which produces -0.11778303. I also plugged this into another calculator and got -0.0511525224. We have two things to investigate:

  • How can we correct the negative value and still maintain the right relative scores?
  • is the IDF calculation described in the explanation actually what's happening in the code?
DELETE shape_index 

PUT shape_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "thing_id": {
        "type": "integer"
      },
      "shape": {
        "type": "text"
      },
      "color": {
        "type": "text", 
        "fields": {
          "standard": { "type": "text" },  
          "raw": { "type": "keyword" }     
        }
      }
    }
  }
}


PUT /shape_index/_doc/1
{
  "thing_id": 1,
  "color": "orange red yellow"
}



PUT /shape_index/_doc/2
{
  "thing_id": 2,
  "shape": "red square",
  "color": "orange red yellow"
}


PUT /shape_index/_doc/3
{
  "thing_id": 3,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/4
{
  "thing_id": 4,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/5
{
  "thing_id": 5,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/6
{
  "thing_id": 6,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/7
{
  "thing_id": 7,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/8
{
  "thing_id": 8,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/9
{
  "thing_id": 9,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/10
{
  "thing_id": 10,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/11
{
  "thing_id": 11,
  "color": "orange red yellow"
}


GET /shape_index/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "red",
                "type": "cross_fields",
                "fields": [
                  "color^5",
                  "shape^2"                  
                ],
                "slop": 10
              }
            }
          ]
        }
      },
      "functions": []
    }
  },
  "from": 0,
  "size": 10,
  "explain": true
}

extract from results with explain:

{
              "value": -0.4462871,
              "description": "*weight(shape:red in 0) [PerFieldSimilarity], result of:*",
              "details": [
                {
                  "value": -0.4462871,
                  "description": "score(freq=1.0), computed as boost * idf * tf from:",
                  "details": [
                    {
                      "value": 4.4,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": -0.22314355,
                      "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details": [
                        {
                          "value": 2,
                          "description": "n, number of documents containing term",
                          "details": []
                        },
                        {
                          "value": 1,
                          "description": "N, total number of documents with field",
                          "details": []
                        }
                      ]
                    },
                    {
                      "value": 0.45454544,
                      "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details": [
                        {
                          "value": 1,
                          "description": "freq, occurrences of term within document",
                          "details": []
                        },
                        {
                          "value": 1.2,
                          "description": "k1, term saturation parameter",
                          "details": []
                        },
                        {
                          "value": 0.75,
                          "description": "b, length normalization parameter",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "dl, length of field",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "avgdl, average length of field",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            }

@dblock
Copy link
Member

dblock commented Jul 6, 2023

@snikoyo please do help us with a fix if you have time, making sure we're not copying any non-APLv2 code of course

@msfroh
Copy link
Collaborator

msfroh commented Sep 13, 2023

{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1,
"description": "N, total number of documents with field",
"details": []
}

How do more documents have the term than have the field?! head-asplode

@mingshl
Copy link
Contributor

mingshl commented Sep 18, 2023

Looking closing in the above shape_index example,

I got some findings for the above two questions by validating the numbers from explain. The explain function does its work pretty nicely. (I am using 2 decimal below to make the number looks shorter)

  • How can we correct the negative value and still maintain the right relative scores?

    First of all, the function score took the maximum function score calculate per document per field.

    For example, the above sample shape_index have 11 docs; after it matches the query red, it matches 11 docs in 11 color fields, and 1 shape field.
    Among all fields, color in the all 11 docs has the same function score 3.46, and shape has the function score -0.44.

    As a result, this query function score took the maximum of color field and shape field, which is 3.46 instead of -0.44.

  • is the IDF calculation described in the explanation actually what's happening in the code?

    YES. The description of IDF matches exactly in the code.

    Let's see the break down to get the negative idf score in shape field = -0.22:
    by description, " idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) ", where "n, number of documents containing term" which is 2 , and "N, total number of documents with field" which is 1.
    so idf_shape = log(1 + (2-1+0.5)/(2+0.5)) = log (1 + (-0.2)) = log(0.8) = -0.22

So in this bug, the negative number is causing by N < n, which is the total number of documents with fields is smaller than the number of documents contains term when a document have more than one field containing the query term.

Both N and n are not populated correctly.

this is how it's calculated idf currently:

** for field shape has 2 documents containing term red , n = 2 and 1 total documents with field 'shape' N =1 ,
** for field color has 2 documents containing term red, n = 2, and 4 total documents with field 'shape' N =4 .

To fix the idf calculated for matching multiple fields, is to have a logic to take idf calculation separate per field.

** for field shape has 1 documents containing term red , n = 1 and 1 total documents with field 'shape' N =1 ,
** for field color has 11 documents containing term red, n = 11, and 11 total documents with field 'shape' N =11

In the same example of shape's idf score -0.22, the provided fix will calculate N =1, n=1, idf_shape = log(1 + ( 1- 1+ 0.5) / (1 + 0.5)) = log(1 + (0.5)/1.5) = log(1.33) = 0.28.

want some opinions about the proposed fix logic above...

@anasalkouz anasalkouz added Search:Relevance and removed Search Search query, autocomplete ...etc labels Sep 20, 2023
@mingshl
Copy link
Contributor

mingshl commented Sep 21, 2023

I am thinking deeper on what is the more appropriate way to calculate and show the idf score per doc, and per doc per field.

in the response, the structure is per doc_id = max of score per doc per field.

hit{ ..., 
     "max_score": 3.465736
     "hits": [
      { "_id": 2,
        "score": 
        "_explanation": {          
          "value": 3.465736,
          "description": "max of:",
          "details": [
             "value": 3.465736,
              "description": "weight(field_1) [PerFieldSimilarity], result of:",
              "details": [ .... "score", "idf", "tf" ] ,

             "value": 1.234334243,
            "description": "weight(field_2) [PerFieldSimilarity], result of:",
             "details": [ .... "score", "idf", "tf" ] 
         ] } }
, 
       "_id": 3,  ....
                ]
                   }
                    }

The above example has overlapped in matching document, so the calculating of idf make sense. But if the cross fields has no overlapping in the documents. Calculating from the max of idf per field method is smaller than real idf.

Let's see another example of Index heart_diseases

mappings: 
{
  "mappings": {
    "dynamic": false,
    "properties": {
      "smoking": {
        "type": "text"
      },
      "diabetes": {
        "type": "text"
      }
    }
  }
}

In calculating idf for cross fields, both 'smoking' and 'diabetes' to match the term 'positive'

Currently, the explain will explain per shard level, per field's idf. Let's discuss in one shard to
simply this problem and explain per field's idf.

  • Method 1: calculate idf per field, get the max of idf per fields (Current method)
  • Method 2: calculate idf per field, get the average of idf per fields
  • Method 3: calculate idf per document, all fields sharing the same idf

for calculate idf: there are two variables:
n: the number of document containing term
N: the number of total document containing field

Case 1: one patient has both attributes positive. There are two matches overlapping in one document, only one document matched.

Screenshot 2023-09-21 at 9 35 24 AM

Search hit = 1, let's calculate the idf in these three methods, to show the idfs are the same when there are completely overlapping in matching documents.

Method 1: calculate idf per field, get the max of idf per fields

for doc_0, taking the maximum per field: n=1, N=1
for field "smoking": n =1, N =1,
for field "diabetes" n =1, N=1

Method 2: calculate idf per field, get the average of idf per fields

for doc_0, taking the average per field: n=1, N=1
for field "smoking": n =1, N =1,
for field "diabetes" n =1, N=1

Method 3: calculate idf per document, all docs sharing the same idf

there is only doc_0 matching the term "positive"
for doc_0, n=1, N=1
for field "smoking": n =1, N =1,
for field "diabetes" n =1, N=1

Case 2: two patient has distinct attribute positive. There are no match in document overlapping, both documents matched
Screenshot 2023-09-21 at 9 37 33 AM

Search hit = 2, let's calculate the idf in these three methods:

Method 1: calculate idf per field, get the max of idf per fields

for doc_0, taking the maximum per field: n=1, N=2
for field "smoking": n =1, N =2,
for field "diabetes" n =0, N =2
for doc_1, taking the maximum per field: n=1, N=2
for field "smoking": n =0, N =2,
for field "diabetes" n =1, N =2

Method 2: calculate idf per field, get the average of idf per fields

for doc_0, taking the average per field: n=0.5, N=2
for field "smoking": n =1, N =2,
for field "diabetes" n =0, N =2

for doc_1, taking the average per field: n=0.5, N=2
for field "smoking": n =0, N =2,
for field "diabetes" n =1, N =2

Method 3: calculate idf per document, all docs sharing the same idf

there are 2 docs matching the term "positive"

for doc_0, n=2, N=2, two documents contains the term "positive" and two documents contains the fields
for field "smoking": n =1, N =2,
for field "diabetes" n =0, N=2

for doc_1, n=2, N=2, two documents contains the term "positive" and two documents contains the fields
for field "smoking": n = 0, N =2,
for field "diabetes" n =1, N=2

So in case 2, the current method1 is calculating a smaller idf, and method3 will be the real idf by the definition.

@mingshl
Copy link
Contributor

mingshl commented Oct 20, 2023

To simplify this bug, I tried getting rid of the function score query, and using the multi-match query with cross field and I can reproduce the negative score issue, so function score query is not related to this bug.

I found out the negative score is due to when counting the n, number of documents containing term, it's not separating per field level.

Here I reuse @macohen 's documents shape_index, and use one shard instead. (noticed that if using multiple shard, the document count and document frequency will become complicated because the documents can be in different shards, using query_and_fetch can avoid the complication, but I will use one shard to make it look easier)

curl -XPUT localhost:9200/shape_index -d ' {
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "thing_id": {
        "type": "integer"
      },
      "shape": {
        "type": "text"
      },
      "color": {
        "type": "text", 
        "fields": {
          "standard": { "type": "text" },  
          "raw": { "type": "keyword" }     
        }
      }
    }
  }
}' -H "Content-Type:Application/json"

use multi-match query with cross field, getting rid of functional score query:


curl -XGET 'localhost:9200/shape_index/_search?pretty=true&search_type=dfs_query_then_fetch' -d ' {
      "query": {
        "bool": {
          "must": [
            {"multi_match": {
                "query": "red",
                "type": "cross_fields",
                "fields": [
                  "color^5",
                  "shape^2"                  
                ],
                "slop": 10
              }}
          ]
        }
      },
  
                 "explain": true
}' -H "Content-Type:Application/json"

In the response for document id 2.

{
"thing_id": 2,
  "shape": "red square",
  "color": "orange red yellow"
  }

the explain is

{
              "value" : -0.4462871,
              "description" : "weight(shape:red in 1) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : -0.4462871,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 4.4,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : -0.22314355,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 1,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                   

,
            {
              "value" : 7.8430796,
              "description" : "weight(color:red in 1) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 7.8430796,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 11.0,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5686159,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 11,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },

The two fields, color and shape has the n=2, however, we can tell from the document id 2, each field only contain 1 document containing the term 'red'.

While looking at the DisjunctionMaxQuery, the cross field query is rewrite as ((color:red)^5 | (shape:red)^2), if the logic is working as expected, the document frequency should calculate by per field level.

Will look further..

@msfroh
Copy link
Collaborator

msfroh commented Feb 27, 2024

I think "Method 3: calculate idf per document, all fields sharing the same idf" makes the most sense.

The BM25 calculation assumes that it is applied to terms from a single field. If a single term matches across multiple fields, the document and term frequencies should be calculated across those fields too, IMO. (Of course, if another term only matches on one of the fields, should we take the IDF of that term across all fields? Probably.)

@mingshl mingshl removed their assignment May 7, 2024
@ylwu-amzn
Copy link
Contributor

ylwu-amzn commented May 9, 2024

@msfroh This is an interesting problem which need careful design of algorithm. We need to consider kinds of edge cases, and it may impact existing users if the algorithm changed.
How about we ask scientist team to take a look and finalize the algorithms first. Then we would be in a better position to evaluate the required effort and make an effective plan.

@mingshl
Copy link
Contributor

mingshl commented May 9, 2024

One of the biggest con of " calculating idf per document, all fields sharing the same idf", the per field score will be skewed because of the weight.

in the above query, the weight is different

fields": [
                  "color^5",
                  "shape^2"                  
                ],

If we apply the same idf per field,

in this document,
"color" : "red",
"shape": "red red red red red red red red red red red red red red red red red red red red"

In PerFieldSimilarity, color field will have higher score than shape field because we have the same idf for these two fields, even though shape has way more term frequency.

@msfroh
Copy link
Collaborator

msfroh commented May 9, 2024

How about we ask scientist team to take a look and finalize the algorithms first. Then we would be in a better position to evaluate the required effort and make an effective plan.

As far as I know, the real problem is less about the scores being "wrong" and more about the negative scores causing an exception.

How about if somewhere around here, we say:

if (subQueryScore < 0) {
  subQueryScore = 0.0f;
}

We should still offer a better cross-field score, but this would at least address the immediate pain point.

The "real" solution is probably something like BM25F, as discussed in #3996

@getsaurabh02
Copy link
Member

the real problem is less about the scores being "wrong" and more about the negative scores causing an exception

Agree with @msfroh on this, as addressing the negative scores and absorbing the exception could be the first step to smoothen out the experience and unblock users. This can be followed by better scoring implementations as a follow up step.

@msfroh
Copy link
Collaborator

msfroh commented May 10, 2024

It took a little while to produce a failing test, because the examples given by @macohen and @mingshl above don't return negative scores, because the dismax query that the multi_match query uses takes the max of a positive and negative number, so it returns the positive number. The trick is to set a nonzero tie_breaker so the negative value is added to the positive value. Then we just need to boost the field that produces the negative value so it's big enough to pull the positive value into negative territory.

Here's the YAML REST test that fails:

---
"Do not throw exception if input score is negative":
  - do:
      index:
        index: test
        id: 1
        body: { "color" : "orange red yellow" }
  - do:
      index:
        index: test
        id: 2
        body: { "color": "orange red purple", "shape": "red square" }
  - do:
      index:
        index: test
        id: 3
        body: { "color" : "orange red yellow purple" }
  - do:
      indices.refresh: { }
  - do:
      search:
        index: test
        body:
          query:
            function_score:
              query:
                multi_match:
                  query: "red"
                  type: "cross_fields"
                  fields: [ "color^0.1", "shape^100"]
                  tie_breaker: 0.1
              functions: [{
                "script_score": {
                  "script": {
                    "lang": "painless",
                    "source": "_score"
                  }
                }
              }]
          explain: true
  - match: { hits.total.value: 3 }
  - match: { hits.hits.1._score: 0.0 }

I'll post the cheap workaround (set the value to zero if negative) to make it succeed.

@ylwu-amzn
Copy link
Contributor

Thanks @msfroh , sounds a plan to fix negative value first if no concern, then we can figure out a better way to calculate score.

@msfroh
Copy link
Collaborator

msfroh commented May 25, 2024

I actually found what I believe is a pretty safe fix for the underlying issue that should have no effect on any "normal" index, where the fields being searched are present in most documents: #13829

It only impacts scoring in the specific cases that may produce negative scores (where the maximum document frequency of any term exceeds the minimum number of docs containing a field, across all fields).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment