[BUG] function score query returned an invalid (negative) score with multi match cross fields query #7860

snikoyo · 2023-06-01T10:42:28Z

Describe the bug

The cross_fields scoring type can produce negative scores when some documents are missing fields.

This bug was already reported for Elasticsearch and fixed with this MR for Elasticsearch >= 8.4.

We encounter it randomly on production for certain queries when we search a field that does not exist in all documents.
It appears deterministically for these search queries.

Error message:

{ "error": { "root_cause": [ { "type": "exception", "reason": "function score query returned an invalid score: -0.05202394 for doc: 1123896" } ], "type": "search_phase_execution_exception", "reason": "all shards failed", "phase": "query", "grouped": true, "failed_shards": [ { "shard": 0, "index": "my_index", "node": "my_node", "reason": { "type": "exception", "reason": "function score query returned an invalid score: -0.05202394 for doc: 1123896" } } ] }, "status": 500 }

The document number mentioned in the error message does not exist.

To Reproduce

The steps to reproduce are in the above mentioned bug report for Elasticsearch.

Expected behavior

The function score should never return a negative value.

Plugins

None.

Host/Environment (please complete the following information):

OpenSearch version: 2.5

The text was updated successfully, but these errors were encountered:

macohen · 2023-06-29T13:30:55Z

In this case, a term is in every field in all the documents: "red" and it also appears in the field "shape" which only exists in one doc.

given n = 4 (number of documents containing the term) and N = 3 (total number of documents with the field), the IDF calculation: log(1 + (N - n + 0.5) / (n + 0.5)) where n = 4, documents containing term and N = 3: log(1 + (3 - 4 + 0.5)/( 4 + 0.5)) which produces -0.11778303. I also plugged this into another calculator and got -0.0511525224. We have two things to investigate:

How can we correct the negative value and still maintain the right relative scores?
is the IDF calculation described in the explanation actually what's happening in the code?

DELETE shape_index 

PUT shape_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "thing_id": {
        "type": "integer"
      },
      "shape": {
        "type": "text"
      },
      "color": {
        "type": "text", 
        "fields": {
          "standard": { "type": "text" },  
          "raw": { "type": "keyword" }     
        }
      }
    }
  }
}


PUT /shape_index/_doc/1
{
  "thing_id": 1,
  "color": "orange red yellow"
}



PUT /shape_index/_doc/2
{
  "thing_id": 2,
  "shape": "red square",
  "color": "orange red yellow"
}


PUT /shape_index/_doc/3
{
  "thing_id": 3,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/4
{
  "thing_id": 4,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/5
{
  "thing_id": 5,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/6
{
  "thing_id": 6,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/7
{
  "thing_id": 7,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/8
{
  "thing_id": 8,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/9
{
  "thing_id": 9,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/10
{
  "thing_id": 10,
  "color": "orange red yellow"
}


PUT /shape_index/_doc/11
{
  "thing_id": 11,
  "color": "orange red yellow"
}


GET /shape_index/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "red",
                "type": "cross_fields",
                "fields": [
                  "color^5",
                  "shape^2"                  
                ],
                "slop": 10
              }
            }
          ]
        }
      },
      "functions": []
    }
  },
  "from": 0,
  "size": 10,
  "explain": true
}

extract from results with explain:

{
              "value": -0.4462871,
              "description": "*weight(shape:red in 0) [PerFieldSimilarity], result of:*",
              "details": [
                {
                  "value": -0.4462871,
                  "description": "score(freq=1.0), computed as boost * idf * tf from:",
                  "details": [
                    {
                      "value": 4.4,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": -0.22314355,
                      "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details": [
                        {
                          "value": 2,
                          "description": "n, number of documents containing term",
                          "details": []
                        },
                        {
                          "value": 1,
                          "description": "N, total number of documents with field",
                          "details": []
                        }
                      ]
                    },
                    {
                      "value": 0.45454544,
                      "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details": [
                        {
                          "value": 1,
                          "description": "freq, occurrences of term within document",
                          "details": []
                        },
                        {
                          "value": 1.2,
                          "description": "k1, term saturation parameter",
                          "details": []
                        },
                        {
                          "value": 0.75,
                          "description": "b, length normalization parameter",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "dl, length of field",
                          "details": []
                        },
                        {
                          "value": 2,
                          "description": "avgdl, average length of field",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            }

dblock · 2023-07-06T17:55:21Z

@snikoyo please do help us with a fix if you have time, making sure we're not copying any non-APLv2 code of course

msfroh · 2023-09-13T17:21:14Z

{
"value": 2,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1,
"description": "N, total number of documents with field",
"details": []
}

How do more documents have the term than have the field?! head-asplode

mingshl · 2023-09-18T21:12:28Z

Looking closing in the above shape_index example,

I got some findings for the above two questions by validating the numbers from explain. The explain function does its work pretty nicely. (I am using 2 decimal below to make the number looks shorter)

How can we correct the negative value and still maintain the right relative scores?

First of all, the function score took the maximum function score calculate per document per field.

For example, the above sample shape_index have 11 docs; after it matches the query red, it matches 11 docs in 11 color fields, and 1 shape field.
Among all fields, color in the all 11 docs has the same function score 3.46, and shape has the function score -0.44.

As a result, this query function score took the maximum of color field and shape field, which is 3.46 instead of -0.44.
is the IDF calculation described in the explanation actually what's happening in the code?

YES. The description of IDF matches exactly in the code.

Let's see the break down to get the negative idf score in shape field = -0.22:
by description, " idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) ", where "n, number of documents containing term" which is 2 , and "N, total number of documents with field" which is 1.
so idf_shape = log(1 + (2-1+0.5)/(2+0.5)) = log (1 + (-0.2)) = log(0.8) = -0.22

So in this bug, the negative number is causing by N < n, which is the total number of documents with fields is smaller than the number of documents contains term when a document have more than one field containing the query term.

Both N and n are not populated correctly.

this is how it's calculated idf currently:

** for field shape has 2 documents containing term red , n = 2 and 1 total documents with field 'shape' N =1 ,
** for field color has 2 documents containing term red, n = 2, and 4 total documents with field 'shape' N =4 .

To fix the idf calculated for matching multiple fields, is to have a logic to take idf calculation separate per field.

** for field shape has 1 documents containing term red , n = 1 and 1 total documents with field 'shape' N =1 ,
** for field color has 11 documents containing term red, n = 11, and 11 total documents with field 'shape' N =11

In the same example of shape's idf score -0.22, the provided fix will calculate N =1, n=1, idf_shape = log(1 + ( 1- 1+ 0.5) / (1 + 0.5)) = log(1 + (0.5)/1.5) = log(1.33) = 0.28.

want some opinions about the proposed fix logic above...

mingshl · 2023-09-21T16:39:11Z

I am thinking deeper on what is the more appropriate way to calculate and show the idf score per doc, and per doc per field.

in the response, the structure is per doc_id = max of score per doc per field.

hit{ ..., 
     "max_score": 3.465736
     "hits": [
      { "_id": 2,
        "score": 
        "_explanation": {          
          "value": 3.465736,
          "description": "max of:",
          "details": [
             "value": 3.465736,
              "description": "weight(field_1) [PerFieldSimilarity], result of:",
              "details": [ .... "score", "idf", "tf" ] ,

             "value": 1.234334243,
            "description": "weight(field_2) [PerFieldSimilarity], result of:",
             "details": [ .... "score", "idf", "tf" ] 
         ] } }
, 
       "_id": 3,  ....
                ]
                   }
                    }

The above example has overlapped in matching document, so the calculating of idf make sense. But if the cross fields has no overlapping in the documents. Calculating from the max of idf per field method is smaller than real idf.

Let's see another example of Index heart_diseases

mappings: 
{
  "mappings": {
    "dynamic": false,
    "properties": {
      "smoking": {
        "type": "text"
      },
      "diabetes": {
        "type": "text"
      }
    }
  }
}

In calculating idf for cross fields, both 'smoking' and 'diabetes' to match the term 'positive'

Currently, the explain will explain per shard level, per field's idf. Let's discuss in one shard to
simply this problem and explain per field's idf.

Method 1: calculate idf per field, get the max of idf per fields (Current method)
Method 2: calculate idf per field, get the average of idf per fields
Method 3: calculate idf per document, all fields sharing the same idf

for calculate idf: there are two variables:
n: the number of document containing term
N: the number of total document containing field

Case 1: one patient has both attributes positive. There are two matches overlapping in one document, only one document matched.

Search hit = 1, let's calculate the idf in these three methods, to show the idfs are the same when there are completely overlapping in matching documents.

Method 1: calculate idf per field, get the max of idf per fields

for doc_0, taking the maximum per field: n=1, N=1
for field "smoking": n =1, N =1,
for field "diabetes" n =1, N=1

Method 2: calculate idf per field, get the average of idf per fields

for doc_0, taking the average per field: n=1, N=1
for field "smoking": n =1, N =1,
for field "diabetes" n =1, N=1

Method 3: calculate idf per document, all docs sharing the same idf

there is only doc_0 matching the term "positive"
for doc_0, n=1, N=1
for field "smoking": n =1, N =1,
for field "diabetes" n =1, N=1

Case 2: two patient has distinct attribute positive. There are no match in document overlapping, both documents matched

Search hit = 2, let's calculate the idf in these three methods:

Method 1: calculate idf per field, get the max of idf per fields

for doc_0, taking the maximum per field: n=1, N=2
for field "smoking": n =1, N =2,
for field "diabetes" n =0, N =2
for doc_1, taking the maximum per field: n=1, N=2
for field "smoking": n =0, N =2,
for field "diabetes" n =1, N =2

Method 2: calculate idf per field, get the average of idf per fields

for doc_0, taking the average per field: n=0.5, N=2
for field "smoking": n =1, N =2,
for field "diabetes" n =0, N =2

for doc_1, taking the average per field: n=0.5, N=2
for field "smoking": n =0, N =2,
for field "diabetes" n =1, N =2

Method 3: calculate idf per document, all docs sharing the same idf

there are 2 docs matching the term "positive"

for doc_0, n=2, N=2, two documents contains the term "positive" and two documents contains the fields
for field "smoking": n =1, N =2,
for field "diabetes" n =0, N=2

for doc_1, n=2, N=2, two documents contains the term "positive" and two documents contains the fields
for field "smoking": n = 0, N =2,
for field "diabetes" n =1, N=2

So in case 2, the current method1 is calculating a smaller idf, and method3 will be the real idf by the definition.

mingshl · 2023-10-20T00:07:48Z

To simplify this bug, I tried getting rid of the function score query, and using the multi-match query with cross field and I can reproduce the negative score issue, so function score query is not related to this bug.

I found out the negative score is due to when counting the n, number of documents containing term, it's not separating per field level.

Here I reuse @macohen 's documents shape_index, and use one shard instead. (noticed that if using multiple shard, the document count and document frequency will become complicated because the documents can be in different shards, using query_and_fetch can avoid the complication, but I will use one shard to make it look easier)

curl -XPUT localhost:9200/shape_index -d ' {
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "thing_id": {
        "type": "integer"
      },
      "shape": {
        "type": "text"
      },
      "color": {
        "type": "text", 
        "fields": {
          "standard": { "type": "text" },  
          "raw": { "type": "keyword" }     
        }
      }
    }
  }
}' -H "Content-Type:Application/json"

use multi-match query with cross field, getting rid of functional score query:


curl -XGET 'localhost:9200/shape_index/_search?pretty=true&search_type=dfs_query_then_fetch' -d ' {
      "query": {
        "bool": {
          "must": [
            {"multi_match": {
                "query": "red",
                "type": "cross_fields",
                "fields": [
                  "color^5",
                  "shape^2"                  
                ],
                "slop": 10
              }}
          ]
        }
      },
  
                 "explain": true
}' -H "Content-Type:Application/json"

In the response for document id 2.

{
"thing_id": 2,
  "shape": "red square",
  "color": "orange red yellow"
  }

the explain is

{
              "value" : -0.4462871,
              "description" : "weight(shape:red in 1) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : -0.4462871,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 4.4,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : -0.22314355,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 1,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                   

,
            {
              "value" : 7.8430796,
              "description" : "weight(color:red in 1) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 7.8430796,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 11.0,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.5686159,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 2,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 11,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },

The two fields, color and shape has the n=2, however, we can tell from the document id 2, each field only contain 1 document containing the term 'red'.

While looking at the DisjunctionMaxQuery, the cross field query is rewrite as ((color:red)^5 | (shape:red)^2), if the logic is working as expected, the document frequency should calculate by per field level.

Will look further..

msfroh · 2024-02-27T21:13:12Z

I think "Method 3: calculate idf per document, all fields sharing the same idf" makes the most sense.

The BM25 calculation assumes that it is applied to terms from a single field. If a single term matches across multiple fields, the document and term frequencies should be calculated across those fields too, IMO. (Of course, if another term only matches on one of the fields, should we take the IDF of that term across all fields? Probably.)

ylwu-amzn · 2024-05-09T00:07:37Z

@msfroh This is an interesting problem which need careful design of algorithm. We need to consider kinds of edge cases, and it may impact existing users if the algorithm changed.
How about we ask scientist team to take a look and finalize the algorithms first. Then we would be in a better position to evaluate the required effort and make an effective plan.

mingshl · 2024-05-09T00:08:49Z

One of the biggest con of " calculating idf per document, all fields sharing the same idf", the per field score will be skewed because of the weight.

in the above query, the weight is different

fields": [
                  "color^5",
                  "shape^2"                  
                ],

If we apply the same idf per field,

in this document,
"color" : "red",
"shape": "red red red red red red red red red red red red red red red red red red red red"

In PerFieldSimilarity, color field will have higher score than shape field because we have the same idf for these two fields, even though shape has way more term frequency.

msfroh · 2024-05-09T22:15:53Z

How about we ask scientist team to take a look and finalize the algorithms first. Then we would be in a better position to evaluate the required effort and make an effective plan.

As far as I know, the real problem is less about the scores being "wrong" and more about the negative scores causing an exception.

How about if somewhere around here, we say:

if (subQueryScore < 0) {
  subQueryScore = 0.0f;
}

We should still offer a better cross-field score, but this would at least address the immediate pain point.

The "real" solution is probably something like BM25F, as discussed in #3996

getsaurabh02 · 2024-05-09T22:58:03Z

the real problem is less about the scores being "wrong" and more about the negative scores causing an exception

Agree with @msfroh on this, as addressing the negative scores and absorbing the exception could be the first step to smoothen out the experience and unblock users. This can be followed by better scoring implementations as a follow up step.

msfroh · 2024-05-10T17:31:27Z

It took a little while to produce a failing test, because the examples given by @macohen and @mingshl above don't return negative scores, because the dismax query that the multi_match query uses takes the max of a positive and negative number, so it returns the positive number. The trick is to set a nonzero tie_breaker so the negative value is added to the positive value. Then we just need to boost the field that produces the negative value so it's big enough to pull the positive value into negative territory.

Here's the YAML REST test that fails:

---
"Do not throw exception if input score is negative":
  - do:
      index:
        index: test
        id: 1
        body: { "color" : "orange red yellow" }
  - do:
      index:
        index: test
        id: 2
        body: { "color": "orange red purple", "shape": "red square" }
  - do:
      index:
        index: test
        id: 3
        body: { "color" : "orange red yellow purple" }
  - do:
      indices.refresh: { }
  - do:
      search:
        index: test
        body:
          query:
            function_score:
              query:
                multi_match:
                  query: "red"
                  type: "cross_fields"
                  fields: [ "color^0.1", "shape^100"]
                  tie_breaker: 0.1
              functions: [{
                "script_score": {
                  "script": {
                    "lang": "painless",
                    "source": "_score"
                  }
                }
              }]
          explain: true
  - match: { hits.total.value: 3 }
  - match: { hits.hits.1._score: 0.0 }

I'll post the cheap workaround (set the value to zero if negative) to make it succeed.

ylwu-amzn · 2024-05-10T19:11:33Z

Thanks @msfroh , sounds a plan to fix negative value first if no concern, then we can figure out a better way to calculate score.

msfroh · 2024-05-25T03:03:40Z

I actually found what I believe is a pretty safe fix for the underlying issue that should have no effect on any "normal" index, where the fields being searched are present in most documents: #13829

It only impacts scoring in the specific cases that may produce negative scores (where the maximum document frequency of any term exceeds the minimum number of docs containing a field, across all fields).

snikoyo added bug Something isn't working untriaged labels Jun 1, 2023

andrross added the Search Search query, autocomplete ...etc label Jun 1, 2023

sejli removed the untriaged label Jun 7, 2023

sejli assigned macohen Jun 28, 2023

macohen removed their assignment Jul 7, 2023

This was referenced Aug 24, 2023

[BUG] Cross Fields Query can Generate Negative Score #9542

Closed

Avoid negative scores in multi_match #9571

Closed

msfroh assigned mingshl Sep 18, 2023

anasalkouz added Search:Relevance and removed Search Search query, autocomplete ...etc labels Sep 20, 2023

mingshl removed their assignment May 7, 2024

msfroh mentioned this issue May 10, 2024

Do not pass negative scores into function_score or script_score queries #13627

Closed

9 tasks

msfroh linked a pull request May 25, 2024 that will close this issue

Avoid negative scores returned from multi_match query with cross_fields #13829

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] function score query returned an invalid (negative) score with multi match cross fields query #7860

[BUG] function score query returned an invalid (negative) score with multi match cross fields query #7860

snikoyo commented Jun 1, 2023 •

edited

macohen commented Jun 29, 2023 •

edited

dblock commented Jul 6, 2023

msfroh commented Sep 13, 2023

mingshl commented Sep 18, 2023 •

edited by msfroh

mingshl commented Sep 21, 2023 •

edited

mingshl commented Oct 20, 2023

msfroh commented Feb 27, 2024

ylwu-amzn commented May 9, 2024 •

edited

mingshl commented May 9, 2024 •

edited

msfroh commented May 9, 2024

getsaurabh02 commented May 9, 2024

msfroh commented May 10, 2024

ylwu-amzn commented May 10, 2024

msfroh commented May 25, 2024

[BUG] function score query returned an invalid (negative) score with multi match cross fields query #7860

[BUG] function score query returned an invalid (negative) score with multi match cross fields query #7860

Comments

snikoyo commented Jun 1, 2023 • edited

macohen commented Jun 29, 2023 • edited

dblock commented Jul 6, 2023

msfroh commented Sep 13, 2023

mingshl commented Sep 18, 2023 • edited by msfroh

mingshl commented Sep 21, 2023 • edited

mingshl commented Oct 20, 2023

msfroh commented Feb 27, 2024

ylwu-amzn commented May 9, 2024 • edited

mingshl commented May 9, 2024 • edited

msfroh commented May 9, 2024

getsaurabh02 commented May 9, 2024

msfroh commented May 10, 2024

ylwu-amzn commented May 10, 2024

msfroh commented May 25, 2024

snikoyo commented Jun 1, 2023 •

edited

macohen commented Jun 29, 2023 •

edited

mingshl commented Sep 18, 2023 •

edited by msfroh

mingshl commented Sep 21, 2023 •

edited

ylwu-amzn commented May 9, 2024 •

edited

mingshl commented May 9, 2024 •

edited