Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Pipeline Collection smart search #43

Open
ddematheu opened this issue Dec 26, 2023 · 13 comments
Open

Implement Pipeline Collection smart search #43

ddematheu opened this issue Dec 26, 2023 · 13 comments

Comments

@ddematheu
Copy link
Contributor

Currently support unified (re-rank results into single list) and separate (results for each pipeline returned separately) searches for a collection .

Adding smart search which will do a smart routing to identify what collections are worth searching based on the query. Using the description of the pipeline, match to query.

@sky-2002
Copy link
Contributor

Hey @ddematheu , can you elaborate this? I would like to contribute to this.

@ddematheu
Copy link
Contributor Author

Sure.

At a high level, we have Pipelines that each have a description associated to them. (https://github.com/NeumTry/NeumAI/blob/main/neumai/neumai/Pipelines/Pipeline.py) The pipeline represents a collection of data as it has a data source as well as a vector DB associated to it.

We introduced PipelineCollection (https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neumai_tools/PipelineCollection/PipelineCollection.py) as an easy way to query multiple pipelines at the same time. Ex. I want to query data both from a user record in Postgres as well as general info from files in S3. This is sort of helpful, but the main piece of feedback we have heard is that the preference would be to dynamically make decisions on what data collection to query based on the question. Ex. If I want to know the status of a user then I would query Postgres vs if I want to get the information for a mortgage they are getting I would go to S3 where the mortgage document is stored.

With this in mind, I have stubbed out a search_routed method.

The method is designed to take a Collection of Pipelines (1+) and using the description field decide which one to use.

For the decision, there are two approaches I have in mind:

  1. Using embeddings to do some basic classification (compare the embedding of the description vs the embedding of the query ). Have some threshold for the similarity score to decide if I should search or not that pipeline given the query.

  2. Using an LLM with function calling, to decide based on the description for each pipeline collection which one to query.

I was leaning towards #1 to start given that it is more lightweight and will provide faster responses. But #2 might provide better quality.

@sky-2002
Copy link
Contributor

sky-2002 commented Dec 28, 2023

@ddematheu Thanks, that made it more clear to me. I can also think of an approach like comparing query with a cluster center for each pipeline/sink(pre-computed), something which is representative of the the data in the pipeline/sink, along with pipeline description. So its same as your point 1 with an addition of these pipeline representatives. I thought of this because when the data in the pipeline changes, the representative would also update and be more relevant. Would that be useful?

@ddematheu
Copy link
Contributor Author

ddematheu commented Dec 29, 2023 via email

@sky-2002
Copy link
Contributor

@ddematheu I will have to see if each DB offers this, will get back on that.

Regarding implementation, as an initial idea, I had thought of it like the way you mentioned:

Approach

  • Every SinkCOnnector would have to define methods compute_cluster_center and update_cluster_center
  • Whenever new data is added to a sink, it would trigger the update_cluster_center method

Doubts

  1. In each sink, a single data unit can have multiple fields, some or all of them vectorized, so which fields to use for cluster center calculation?
  2. I am not sure, but it might happen that user has vectorized data in one sink with 512 dimensions and other sink with 1024 dimensions of embedding, what to do in that case?
  3. What method to use for cluster center calculation? Would simple averaging suffice?

Discussion

  • I would love to know if there are more approaches to this. Please share if you come across any, I would also do some research on that.
  • We can also explore simpler approaches instead of embedding similarity, because semantic stuff would throw multiple options at us like which model to use, what embedding size to consider, what should be the threshold etc and make config cluttered. We can discuss this in detail maybe.

@ddematheu
Copy link
Contributor Author

ddematheu commented Dec 30, 2023 via email

@sky-2002
Copy link
Contributor

sky-2002 commented Dec 31, 2023

@ddematheu Okay, I would start with Marqo and try to implement a basic working version.

Update
I am done with implementing a get_representative_vector method for lanceDB and marqo, using mean of vectors as representative. Would write code for the search_routed and see.

@ddematheu
Copy link
Contributor Author

Sounds good. feel free to open PR and I can take a look to provide feedback.

@sky-2002
Copy link
Contributor

sky-2002 commented Jan 2, 2024

@ddematheu How will the query be vectorized? In the separate search, each time we are using the respective pipeline'e embed_query method. In this case, what to use to vectorize query?

@ddematheu
Copy link
Contributor Author

This is where it gets hard with the representative vector as that vector will be determined by the embedding model used within the each pipeline. Comparing those will be hard. Unless, for the comparison we embed the query using the embed_query for each pipeline and compare. We would then just compare the scores.

So yeah, I think using the embed_query makes sense.

@sky-2002
Copy link
Contributor

sky-2002 commented Jan 2, 2024

Okay then, I would go ahead with the embed_query for now. I am looking to get an initial version up and running as quick as possible, you can then get feedback from users and we can develop more.

@sky-2002
Copy link
Contributor

sky-2002 commented Jan 7, 2024

@ddematheu I have implemented first version of smart search, it works well, tested it using two data sources and two sinks. Excited for this feature and its further improvements!

@ddematheu
Copy link
Contributor Author

Awesome, taking a look at the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants