NucliaDB Query Grammar proposal #1345

jotare · 2023-09-18T09:51:10Z

Description

Describe the proposed changes made in this PR.

How was this PR tested?

Describe how you tested this PR.

codecov · 2023-09-18T09:54:43Z

Codecov Report

Patch and project coverage have no change.

Comparison is base (feeaeb7) 85.54% compared to head (e246029) 85.54%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1345   +/-   ##
=======================================
  Coverage   85.54%   85.54%           
=======================================
  Files         431      431           
  Lines       23629    23629           
=======================================
  Hits        20214    20214           
  Misses       3415     3415

Flag	Coverage Δ
ingest	`75.06% <ø> (ø)`
nucliadb	`67.16% <ø> (ø)`
reader	`74.65% <ø> (ø)`
sdk	`42.51% <ø> (ø)`
train	`60.25% <ø> (ø)`
utils	`84.83% <ø> (ø)`
writer	`84.98% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tarekziade · 2023-09-18T09:56:49Z

proposals/records/006-nucliadb-query-grammar.md

+
+### Solution 2: Rust query parser
+
+Implementing a query parser in Rust replacing tantivy's one will solve the


would that be implementing the same DSL or something new?

We could implement something smaller. Exact terms, phrase ("exact text quoted") and prefix

hermeGarcia · 2023-09-19T07:23:01Z

proposals/records/006-nucliadb-query-grammar.md

+Implementing the query AST in protobuffers would have a simple mapping in the
+Rust side from the protobuf grammar to tantivy or unified index one. However,
+we'll need to move query parsing logic to Python side.
+


I think we should implement a query language.
Using protobuffers for defining the AST may be a problem, either the protobuffer types will be spread all over our codebase, making language modifications a nightmare. Or sooner than later we will end up implementing a lowering pass and a new IR just to get rid of the protobuf (which is basically implementing a new parser). We also need an ergonomic AST to not end up suffering every time we define a pass. I dont think Protobuf can give us that.
In option 2 what I dont like is that it misses the point of defining a language. If we are defining our own query language having different parsers based on some client logic seems like is doomed for code repetition and, as always, time consuming and error prone language modifications. I think it would be better to implement this feature inside the language, maybe through something like tags (but we need to give it a though). I would avoid as much as possible implementing different parsers for the same language inside the project, project for which we have defined the language.

I don't think protobuf is that bad defining the AST. We could implement kind of the same tantivy does with types (BooleanQuery, TermQuery, PraseQuery...). Totally agree if we are not careful we'll end up with even more protobuffer infection though.

Could you explain more this tags idea?

The problem of defining the language and only pass this to the node is the inability to know where it comes from (users or python). If we want to do something special in the Python side (as with /catalog and /seach), we'll need to parse the user query to build the Python one. So Pyhon will need to know about the query language (either directly or with a binding)

Protobuffer infection is a problem, but there are other reasons why that is a bad road to take. Even though Protobuf allows defining recursive types, we can not control how they are represented which is a crucial part when dealing with AST's. We will be forced to use however the library decides to implement this (Most likely a nested Box), which may be far from the best fit at some point in time. Another problem with using protobuf is how difficult defining utils like the visitor trait is going to be, the actual rust code of the AST will only be available after compilation.
If want we want is to make really easy to modify the query based on some business logic there are better approaches, in my opinion our life would be much easier if:

We implement a parser from the textual representation of the query language to json and viceversa, we can do this in Rust and provide bindings. This way we have an representation of the language that is easy to modify and in a format that most languages understand.

The unified index can then use the json representation directly or define its own internal representation of the query language (probably optimised for the kind of passes it needs to do). But the definition of the language is centralised in one crate (the one that can produce json)
The Python side then just needs to translate the query to its json representation, do some changes if it needs to do so and send it to the unified index.

Wow, sounds really cool and powerful. But maybe is an overkill for the requested story (that was basically provide a prefix match on certain nucliadb_texts searches).

Do you think we should go directly for your implementation idea or maybe do a simpler approach. If we are going to develop this, we'll need to prioritize first

Sorry I thought we already had a solution for the feature that is needed right now and this was just for discussing a query language implementation for the near future.
For prefix match in nucliadb_texts lets go for the quick and fast solution, the query language is more of a unified index topic IMO.
Right now I see in the [Tantivy documentation] (https://docs.rs/tantivy/latest/tantivy/query/struct.QueryParser.html) that there is an easy path:

Phrase terms also support the * prefix operator which switches the phrase’s matching to consider all documents which contain the last term as a prefix, e.g. "big bad wo"* will match "big bad wolf".

Since we can add prefix match to frase queries using Tantivy's query language, if you need an endpoint that needs to run a prefix query for a given word A, just send the query "A"* to the node. That will do right?

NucliaDB Query Grammar proposal

e246029

jotare requested a review from a team as a code owner September 18, 2023 09:51

tarekziade reviewed Sep 18, 2023

View reviewed changes

hermeGarcia reviewed Sep 19, 2023

View reviewed changes

javitonino closed this May 24, 2024

javitonino deleted the nucliadb-query-grammar-proposal branch May 24, 2024 13:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NucliaDB Query Grammar proposal #1345

NucliaDB Query Grammar proposal #1345

jotare commented Sep 18, 2023

codecov bot commented Sep 18, 2023 •

edited

tarekziade Sep 18, 2023

jotare Sep 18, 2023

hermeGarcia Sep 19, 2023 •

edited

jotare Sep 19, 2023

hermeGarcia Sep 19, 2023 •

edited

jotare Sep 19, 2023

hermeGarcia Sep 20, 2023 •

edited


		### Solution 2: Rust query parser

		Implementing a query parser in Rust replacing tantivy's one will solve the

NucliaDB Query Grammar proposal #1345

NucliaDB Query Grammar proposal #1345

Conversation

jotare commented Sep 18, 2023

Description

How was this PR tested?

codecov bot commented Sep 18, 2023 • edited

Codecov Report

tarekziade Sep 18, 2023

Choose a reason for hiding this comment

jotare Sep 18, 2023

Choose a reason for hiding this comment

hermeGarcia Sep 19, 2023 • edited

Choose a reason for hiding this comment

jotare Sep 19, 2023

Choose a reason for hiding this comment

hermeGarcia Sep 19, 2023 • edited

Choose a reason for hiding this comment

jotare Sep 19, 2023

Choose a reason for hiding this comment

hermeGarcia Sep 20, 2023 • edited

Choose a reason for hiding this comment

codecov bot commented Sep 18, 2023 •

edited

hermeGarcia Sep 19, 2023 •

edited

hermeGarcia Sep 19, 2023 •

edited

hermeGarcia Sep 20, 2023 •

edited