[Question]: Storing Duplicates in vector Db #13595

Amiya-Lahiri-AI · 2024-05-20T08:44:42Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

I have created nodes and the nodes are unique,. However when I am trying to create a vector Store index it is creating duplicate
entries in my vector db. I am using Qdrant Db and I have checked in my vector db, There are duplicate entries available.
this is my Qdrant db initialization

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient(
    url="xxxxx",
    api_key="xxxxx",
)
asynclient = qdrant_client.AsyncQdrantClient(
    url="xxxxxx",
    api_key="xxxxxx",
)
# vector_store = QdrantVectorStore(client=client, collection_name="Tender-Db", aclient=asynclient)

and after creating nodes I am using the storagecontext

from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    nodes,
    storage_context=storage_context,
    embed_model = VoyageEmbeddModel
)
# index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=VoyageEmbeddModel)

can anyone please tell me how to resolve this issue

The text was updated successfully, but these errors were encountered:

dosubot · 2024-05-20T08:45:11Z

To resolve the issue of duplicate entries in your Qdrant Db when using QdrantVectorStore, you should implement a mechanism to check for the existence of nodes before adding them. The QdrantVectorStore class itself does not automatically prevent duplicates based on content; it only uses the id field as a unique identifier.

Here's a strategy to ensure you're only adding unique nodes:

Use the get_nodes method to query the database for existing nodes by their IDs.
Filter out any nodes that are already present in the database from your list of nodes to be added.
Add only the new, unique nodes to the database.

This approach involves checking each node's unique identifier (assuming each BaseNode has a unique node_id attribute) against the existing entries in the database and only proceeding with the addition of nodes that are not already present.

Here is a simplified code snippet to illustrate this approach:

def add_unique_nodes(self, nodes: List[BaseNode]) -> None:
    node_ids = [node.node_id for node in nodes]  # Extract node IDs
    existing_nodes = self.vector_store.get_nodes(node_ids=node_ids)  # Query existing nodes

    existing_node_ids = {node.node_id for node in existing_nodes}  # Existing node IDs
    new_nodes = [node for node in nodes if node.node_id not in existing_node_ids]  # Filter new nodes

    if new_nodes:
        self.vector_store.add(new_nodes)  # Add only new nodes

By incorporating this method, you can effectively prevent the addition of duplicate entries in your Qdrant database. This ensures that your VectorStoreIndex maintains uniqueness among the nodes it contains.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich · 2024-05-20T15:29:37Z

You should use an ingestion pipeline with a docstore and vector store attached

Assuming each input document has a consistent doc id, it can look up the existing hash of that document (if any) in the docstore, and decide to upsert, skip, or insert

Here's one example (ignore the cache though)
https://docs.llamaindex.ai/en/stable/examples/ingestion/redis_ingestion_pipeline/?h=ingestion+pipeline

130jd · 2024-05-20T23:35:05Z

Thanks @logan-markewich would this be your recommended implementation if I wanted to do Chroma vector_store and a locally stored docstore?

Embed and store in Chroma vector_store and local docstore:

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        embed_model,
    ],
    docstore=SimpleDocumentStore(),
    vector_store=ChromaVectorStore(
        chroma_collection=chromadb.PersistentClient(path="./chroma_db").get_or_create_collection("collection_name")
    ),
    docstore_strategy=DocstoreStrategy.UPSERTS
)

nodes = pipeline.run(documents=documents)

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

pipeline.persist("./pipeline_storage")

Load from Chroma vector_store and local docstore:

documents = SimpleDirectoryReader(
    "./test_redis_data", filename_as_id=True
).load_data()

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5"),
    ]
)

pipeline.load("./pipeline_storage")

nodes = pipeline.run(documents=documents)

index = VectorStoreIndex.from_vector_store(
    pipeline.vector_store, embed_model=embed_model
)

logan-markewich · 2024-05-21T04:53:56Z

@130jd not quite -- you should pass in the vector store again when loading. Tbh I would load both the vector and docstore outside of the pipeline and just pass it in. But that's just me

Amiya-Lahiri-AI added the question Further information is requested label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Storing Duplicates in vector Db #13595

[Question]: Storing Duplicates in vector Db #13595

Amiya-Lahiri-AI commented May 20, 2024 •

edited

dosubot bot commented May 20, 2024 •

edited

Details

logan-markewich commented May 20, 2024

130jd commented May 20, 2024 •

edited

logan-markewich commented May 21, 2024

[Question]: Storing Duplicates in vector Db #13595

[Question]: Storing Duplicates in vector Db #13595

Comments

Amiya-Lahiri-AI commented May 20, 2024 • edited

Question Validation

Question

dosubot bot commented May 20, 2024 • edited

Details

logan-markewich commented May 20, 2024

130jd commented May 20, 2024 • edited

logan-markewich commented May 21, 2024

Amiya-Lahiri-AI commented May 20, 2024 •

edited

dosubot bot commented May 20, 2024 •

edited

130jd commented May 20, 2024 •

edited