Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NucliaDB Query Grammar proposal #1345

Closed
wants to merge 1 commit into from
Closed

Conversation

jotare
Copy link
Contributor

@jotare jotare commented Sep 18, 2023

Description

Describe the proposed changes made in this PR.

How was this PR tested?

Describe how you tested this PR.

@jotare jotare requested a review from a team as a code owner September 18, 2023 09:51
@codecov
Copy link

codecov bot commented Sep 18, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (feeaeb7) 85.54% compared to head (e246029) 85.54%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1345   +/-   ##
=======================================
  Coverage   85.54%   85.54%           
=======================================
  Files         431      431           
  Lines       23629    23629           
=======================================
  Hits        20214    20214           
  Misses       3415     3415           
Flag Coverage Δ
ingest 75.06% <ø> (ø)
nucliadb 67.16% <ø> (ø)
reader 74.65% <ø> (ø)
sdk 42.51% <ø> (ø)
train 60.25% <ø> (ø)
utils 84.83% <ø> (ø)
writer 84.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


### Solution 2: Rust query parser

Implementing a query parser in Rust replacing tantivy's one will solve the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would that be implementing the same DSL or something new?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement something smaller. Exact terms, phrase ("exact text quoted") and prefix

Comment on lines +65 to +68
Implementing the query AST in protobuffers would have a simple mapping in the
Rust side from the protobuf grammar to tantivy or unified index one. However,
we'll need to move query parsing logic to Python side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should implement a query language.
Using protobuffers for defining the AST may be a problem, either the protobuffer types will be spread all over our codebase, making language modifications a nightmare. Or sooner than later we will end up implementing a lowering pass and a new IR just to get rid of the protobuf (which is basically implementing a new parser). We also need an ergonomic AST to not end up suffering every time we define a pass. I dont think Protobuf can give us that.
In option 2 what I dont like is that it misses the point of defining a language. If we are defining our own query language having different parsers based on some client logic seems like is doomed for code repetition and, as always, time consuming and error prone language modifications. I think it would be better to implement this feature inside the language, maybe through something like tags (but we need to give it a though). I would avoid as much as possible implementing different parsers for the same language inside the project, project for which we have defined the language.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think protobuf is that bad defining the AST. We could implement kind of the same tantivy does with types (BooleanQuery, TermQuery, PraseQuery...). Totally agree if we are not careful we'll end up with even more protobuffer infection though.

Could you explain more this tags idea?

The problem of defining the language and only pass this to the node is the inability to know where it comes from (users or python). If we want to do something special in the Python side (as with /catalog and /seach), we'll need to parse the user query to build the Python one. So Pyhon will need to know about the query language (either directly or with a binding)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Protobuffer infection is a problem, but there are other reasons why that is a bad road to take. Even though Protobuf allows defining recursive types, we can not control how they are represented which is a crucial part when dealing with AST's. We will be forced to use however the library decides to implement this (Most likely a nested Box), which may be far from the best fit at some point in time. Another problem with using protobuf is how difficult defining utils like the visitor trait is going to be, the actual rust code of the AST will only be available after compilation.
If want we want is to make really easy to modify the query based on some business logic there are better approaches, in my opinion our life would be much easier if:

  1. We implement a parser from the textual representation of the query language to json and viceversa, we can do this in Rust and provide bindings. This way we have an representation of the language that is easy to modify and in a format that most languages understand.
  2. The unified index can then use the json representation directly or define its own internal representation of the query language (probably optimised for the kind of passes it needs to do). But the definition of the language is centralised in one crate (the one that can produce json)
    The Python side then just needs to translate the query to its json representation, do some changes if it needs to do so and send it to the unified index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, sounds really cool and powerful. But maybe is an overkill for the requested story (that was basically provide a prefix match on certain nucliadb_texts searches).

Do you think we should go directly for your implementation idea or maybe do a simpler approach. If we are going to develop this, we'll need to prioritize first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought we already had a solution for the feature that is needed right now and this was just for discussing a query language implementation for the near future.
For prefix match in nucliadb_texts lets go for the quick and fast solution, the query language is more of a unified index topic IMO.
Right now I see in the [Tantivy documentation] (https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) that there is an easy path:

Phrase terms also support the * prefix operator which switches the phrase’s matching to consider all documents which contain the last term as a prefix, e.g. "big bad wo"* will match "big bad wolf".

Since we can add prefix match to frase queries using Tantivy's query language, if you need an endpoint that needs to run a prefix query for a given word A, just send the query "A"* to the node. That will do right?

@javitonino javitonino closed this May 24, 2024
@javitonino javitonino deleted the nucliadb-query-grammar-proposal branch May 24, 2024 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants