Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Function Score Query #2395

Open
alex-au-922 opened this issue May 12, 2024 · 4 comments · May be fixed by #2396
Open

Adding Function Score Query #2395

alex-au-922 opened this issue May 12, 2024 · 4 comments · May be fixed by #2396

Comments

@alex-au-922
Copy link

alex-au-922 commented May 12, 2024

Is your feature request related to a problem? Please describe.

Recently I found that tantivy is lacking some common search engine properties like flexible scoring mechanism on retrieving the relevant docs. Currently users are able to tweak the score through TopDoc's tweak_score method, the method evaluates the score at the last level which makes customizable scoring based on different search branches (Query) difficult.

For example, when using Disjunction Max with several queries, by passing a flexible closure on defining the score, we can easily fine-grained control the scoring and offsets of each queries. Existing solution relies on only Boosting which can nest several hierarchies and becomes hard to read, and Boosting doesn't allow offsets as well.


Describe the solution you'd like

By introducing the FunctionScoreQuery, users can define their own closure for the score modification algorithm in the query level. The score tweak happens before the final TopDoc's tweak_score method thus greater flexibility to users.

Introducing the FunctionScoreQuery brings several benefits:

  • Rather than nesting several Queries and recursively apply the the score method on the scorer when their requirements are complicated, users can just define a function that is clean and neat. For simpler usecases, native query types like BoostQuery is prefered.
  • Other language bindings (like tantivy-py) can hardly inherit the tantivy's Query struct due to different language implementations. For example, for tantivy-py, though pyO3 can turn python objects into rust's struct, we have to first define all the utility classes before tantivy-py can consume.
  • Existing Lucene and ElasticSearch (OpenSearch) have similar functions as well, while Lucene's implementation is very verbose, ElasticSearch's implementation restricts the scope inside the server which is not flexible at all. By utilizing rust's closure feature, existing Lucene and ElasticSearch users can easily migrate their workloads to tantivy with simpler syntax than their counterparts.
@adamreichold adamreichold linked a pull request May 12, 2024 that will close this issue
@adamreichold
Copy link
Contributor

Other language bindings (like tantivy-py) can hardly inherit the tantivy's Query struct due to different language implementations. For example, for tantivy-py, though pyO3 can turn python objects into rust's struct, we have to first define all the utility classes before tantivy-py can consume.

Out of curiosity, do you plan to pass a Python callable to this eventually? If so, I fear from personal experience this might prohibitively slow due to the overhead of getting the GIL and crossing the Python-Rust boundary.

@alex-au-922
Copy link
Author

alex-au-922 commented May 12, 2024

Out of curiosity, do you plan to pass a Python callable to this eventually? If so, I fear from personal experience this might prohibitively slow due to the overhead of getting the GIL and crossing the Python-Rust boundary.

Yes, that's my ultimate goal. Actually I have tried to compile the source in tantivy-py as follow:

@numba.njit
def score_add_10(score: float) -> float:
    return score + 10

function_score_query = Query.function_score_query(
    const_score_query, lambda _, score, __: score_add_10(score)
)

and the numba trick works. I understand that currently pyO3 needs you to acquire a GIL in Python, but I'm not sure if this is still the case in the future. If this PR passes, more investigations should be done on the performance issue when calling Python from Rust with JIT / other compiled code. Other than that, I think this feature should exist while providing templates for other languages are just an extra benefit.

Also, as ElasticSearch's documentation said (which also same as Lucene), the function score query should be called only after a majority of documents are filtered out. I also expect users only map the scoring after the retrieval stage as iterating through all the documents is slow.

@adamreichold
Copy link
Contributor

I understand that currently pyO3 needs you to acquire a GIL in Python, but I'm not sure if this is still the case in the future.

It certainly still does and even though we aware of nogil CPython builds, there are a lot of issues around that still unresolved, so I wouldn't hold my breath. This is particularly problematic as tantivy-py explicitly releases the GIL during search (to allow multi-threaded server in a Python application) which means that invoking a callback does not just mean checking that the GIL is held, but really acquiring the lock bouncing it around all search threads in the worst case.

Other than that, I think this feature should exist while providing templates for other languages are just an extra benefit.

I did not add this to argue against the feature itself, just wanted to share some unhappy experiences trying to inject behaviour as Python code into Rust code.

@alex-au-922
Copy link
Author

I think for integrating python code there could be some alternatives. The first thing come up in my mind is that we can create some pre-built 'function factory' that perform function currying, so user just plug-in their parameters and the function is executed in rust. Say users want a y = m * pow(score, n) + C function.

For more complicated usecase, they might just create their own pyO3 distribution with the additional function signature that suits their case. Although this seems quite similar to implementing their own Query Struct, but still their work is much less that they don't need figure out the whole querying logic like Weight and Scorer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants