-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NucliaDB Query Grammar proposal #1345
Conversation
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## main #1345 +/- ##
=======================================
Coverage 85.54% 85.54%
=======================================
Files 431 431
Lines 23629 23629
=======================================
Hits 20214 20214
Misses 3415 3415
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
||
### Solution 2: Rust query parser | ||
|
||
Implementing a query parser in Rust replacing tantivy's one will solve the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would that be implementing the same DSL or something new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could implement something smaller. Exact terms, phrase ("exact text quoted") and prefix
Implementing the query AST in protobuffers would have a simple mapping in the | ||
Rust side from the protobuf grammar to tantivy or unified index one. However, | ||
we'll need to move query parsing logic to Python side. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should implement a query language.
Using protobuffers for defining the AST may be a problem, either the protobuffer types will be spread all over our codebase, making language modifications a nightmare. Or sooner than later we will end up implementing a lowering pass and a new IR just to get rid of the protobuf (which is basically implementing a new parser). We also need an ergonomic AST to not end up suffering every time we define a pass. I dont think Protobuf can give us that.
In option 2 what I dont like is that it misses the point of defining a language. If we are defining our own query language having different parsers based on some client logic seems like is doomed for code repetition and, as always, time consuming and error prone language modifications. I think it would be better to implement this feature inside the language, maybe through something like tags (but we need to give it a though). I would avoid as much as possible implementing different parsers for the same language inside the project, project for which we have defined the language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think protobuf is that bad defining the AST. We could implement kind of the same tantivy does with types (BooleanQuery, TermQuery, PraseQuery...). Totally agree if we are not careful we'll end up with even more protobuffer infection though.
Could you explain more this tags idea?
The problem of defining the language and only pass this to the node is the inability to know where it comes from (users or python). If we want to do something special in the Python side (as with /catalog and /seach), we'll need to parse the user query to build the Python one. So Pyhon will need to know about the query language (either directly or with a binding)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protobuffer infection is a problem, but there are other reasons why that is a bad road to take. Even though Protobuf allows defining recursive types, we can not control how they are represented which is a crucial part when dealing with AST's. We will be forced to use however the library decides to implement this (Most likely a nested Box
), which may be far from the best fit at some point in time. Another problem with using protobuf is how difficult defining utils like the visitor trait is going to be, the actual rust code of the AST will only be available after compilation.
If want we want is to make really easy to modify the query based on some business logic there are better approaches, in my opinion our life would be much easier if:
- We implement a parser from the textual representation of the query language to json and viceversa, we can do this in Rust and provide bindings. This way we have an representation of the language that is easy to modify and in a format that most languages understand.
- The unified index can then use the json representation directly or define its own internal representation of the query language (probably optimised for the kind of passes it needs to do). But the definition of the language is centralised in one crate (the one that can produce json)
The Python side then just needs to translate the query to its json representation, do some changes if it needs to do so and send it to the unified index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, sounds really cool and powerful. But maybe is an overkill for the requested story (that was basically provide a prefix match on certain nucliadb_texts
searches).
Do you think we should go directly for your implementation idea or maybe do a simpler approach. If we are going to develop this, we'll need to prioritize first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought we already had a solution for the feature that is needed right now and this was just for discussing a query language implementation for the near future.
For prefix match in nucliadb_texts
lets go for the quick and fast solution, the query language is more of a unified index topic IMO.
Right now I see in the [Tantivy documentation] (https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) that there is an easy path:
Phrase terms also support the * prefix operator which switches the phrase’s matching to consider all documents which contain the last term as a prefix, e.g. "big bad wo"* will match "big bad wolf".
Since we can add prefix match to frase queries using Tantivy's query language, if you need an endpoint that needs to run a prefix query for a given word A, just send the query "A"* to the node. That will do right?
Description
Describe the proposed changes made in this PR.
How was this PR tested?
Describe how you tested this PR.