Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Storing Duplicates in vector Db #13595

Open
1 task done
Amiya-Lahiri-AI opened this issue May 20, 2024 · 4 comments
Open
1 task done

[Question]: Storing Duplicates in vector Db #13595

Amiya-Lahiri-AI opened this issue May 20, 2024 · 4 comments
Labels
question Further information is requested

Comments

@Amiya-Lahiri-AI
Copy link

Amiya-Lahiri-AI commented May 20, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I have created nodes and the nodes are unique,. However when I am trying to create a vector Store index it is creating duplicate
entries in my vector db. I am using Qdrant Db and I have checked in my vector db, There are duplicate entries available.
this is my Qdrant db initialization

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient(
    url="xxxxx",
    api_key="xxxxx",
)
asynclient = qdrant_client.AsyncQdrantClient(
    url="xxxxxx",
    api_key="xxxxxx",
)
# vector_store = QdrantVectorStore(client=client, collection_name="Tender-Db", aclient=asynclient)

and after creating nodes I am using the storagecontext

from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    nodes,
    storage_context=storage_context,
    embed_model = VoyageEmbeddModel
)
# index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=VoyageEmbeddModel)

can anyone please tell me how to resolve this issue

@Amiya-Lahiri-AI Amiya-Lahiri-AI added the question Further information is requested label May 20, 2024
Copy link

dosubot bot commented May 20, 2024

To resolve the issue of duplicate entries in your Qdrant Db when using QdrantVectorStore, you should implement a mechanism to check for the existence of nodes before adding them. The QdrantVectorStore class itself does not automatically prevent duplicates based on content; it only uses the id field as a unique identifier.

Here's a strategy to ensure you're only adding unique nodes:

  1. Use the get_nodes method to query the database for existing nodes by their IDs.
  2. Filter out any nodes that are already present in the database from your list of nodes to be added.
  3. Add only the new, unique nodes to the database.

This approach involves checking each node's unique identifier (assuming each BaseNode has a unique node_id attribute) against the existing entries in the database and only proceeding with the addition of nodes that are not already present.

Here is a simplified code snippet to illustrate this approach:

def add_unique_nodes(self, nodes: List[BaseNode]) -> None:
    node_ids = [node.node_id for node in nodes]  # Extract node IDs
    existing_nodes = self.vector_store.get_nodes(node_ids=node_ids)  # Query existing nodes

    existing_node_ids = {node.node_id for node in existing_nodes}  # Existing node IDs
    new_nodes = [node for node in nodes if node.node_id not in existing_node_ids]  # Filter new nodes

    if new_nodes:
        self.vector_store.add(new_nodes)  # Add only new nodes

By incorporating this method, you can effectively prevent the addition of duplicate entries in your Qdrant database. This ensures that your VectorStoreIndex maintains uniqueness among the nodes it contains.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@logan-markewich
Copy link
Collaborator

You should use an ingestion pipeline with a docstore and vector store attached

Assuming each input document has a consistent doc id, it can look up the existing hash of that document (if any) in the docstore, and decide to upsert, skip, or insert

Here's one example (ignore the cache though)
https://docs.llamaindex.ai/en/stable/examples/ingestion/redis_ingestion_pipeline/?h=ingestion+pipeline

@130jd
Copy link

130jd commented May 20, 2024

Thanks @logan-markewich would this be your recommended implementation if I wanted to do Chroma vector_store and a locally stored docstore?

Embed and store in Chroma vector_store and local docstore:

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=SimpleDocumentStore(),
    vector_store=ChromaVectorStore(
        chroma_collection=chromadb.PersistentClient(path="./chroma_db").get_or_create_collection("collection_name")
    ),
    docstore_strategy=DocstoreStrategy.UPSERTS
)

nodes = pipeline.run(documents=documents)

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

pipeline.persist("./pipeline_storage")

Load from Chroma vector_store and local docstore:

documents = SimpleDirectoryReader(
    "./test_redis_data", filename_as_id=True
).load_data()

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ]
)

pipeline.load("./pipeline_storage")

nodes = pipeline.run(documents=documents)

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

@logan-markewich
Copy link
Collaborator

@130jd not quite -- you should pass in the vector store again when loading. Tbh I would load both the vector and docstore outside of the pipeline and just pass it in. But that's just me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants