Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured Search Pipeline #55

Open
ddematheu opened this issue Jan 2, 2024 · 3 comments
Open

Structured Search Pipeline #55

ddematheu opened this issue Jan 2, 2024 · 3 comments
Labels

Comments

@ddematheu
Copy link
Contributor

Querying requirements across RAG fall not only onto unstructured data that has been embedded and added to an vector database. It also falls onto structured data sources where semantic search doesn't really make sense.

Goal: Provide a pipeline interface that connects to a structured data source and generates queries in real-time based on queries.

Implementation:

  • Psuedo Pipeline without an embed or sink connector, just a data source.
  • Data source connector is configured and an initial pull from the database is done to examine the fields available and their types.
  • Search generates a query using an LLM based on the fields available in the database.
  • The Pipeline can be used as part of a PipelineCollection and supported by smart_route in order for model to decide when to use it.

Alternative implementation:

  • In order to reduce the latency associated with having to do 2-3 back to back LLM calls to generated query and validate it, what if the query generation was done pre-emptively and cached in to a vector database.
  • Using an LLM, we would try to predict the top sets of queries that one might expect from the database and its permutations. (This might limit the complexity of the queries, but might answer for 80% of use cases)
  • At search we would run a similarity search of the incoming query against the description of the "cached" queries. We then can run top query against the database.
@ddematheu ddematheu added enhancement New feature or request open for discussion labels Jan 2, 2024
@sky-2002
Copy link
Contributor

@ddematheu Haven't yet fully understood this, but the alternatives sound similar to internals of this project - aidb. Can you please give an example to elaborate this.
As far as I understood, we have some structured data sources. Now we want to map a natural language query to an appropriate SQL query(or any structured query) using an LLM.

@ddematheu
Copy link
Contributor Author

The thought process was given a database, to generate a set of common queries for it (based on schema) using an LLM. Fron there take the queries amd the descriptions for them and embed them (embed the description). Then at runtine when someone searches, we take he search and compare against the embeddings and use the stored query to query the database (or pass into a database for fine tuning based on the search)

It is a bit more similar to this https://github.com/vanna-ai/vanna.

@sky-2002
Copy link
Contributor

sky-2002 commented Jan 20, 2024

@ddematheu Okay, so I understood it like this and tried it on t5-small-text-2-sql model:

input_prompt = '''
tables:\n CREATE TABLE engineers (id: VARCHAR, name: TEXT, age: INT); \n 
query for: Group by the age 'column' 
'''
print("Generted SQL:")
generate_sql(input_prompt=input_prompt)

Output:

Generted SQL:
'SELECT name, age FROM engineers GROUP BY age'

So we would create pairs and embed the description,

{'query': 'SELECT name, age FROM engineers GROUP BY age', 'description': 'Group by the age column'}

Is this what you meant?

Update: Also tried with a small cpu-ready LLM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants