Multi-collection in a RAG system #32876

wilsoncastiblanco · 2024-05-08T18:08:44Z

wilsoncastiblanco
May 8, 2024

Hey there!!

I'm building a RAG system using a different kind of documents, initially, I created a collection with a knowledge base of JSON files, but then I tried to insert into the same collection individual PDF files, so I got this error trying to store the chunks into the vector db using from_documents

Error: pymilvus.exceptions.DataNotMatchException: <DataNotMatchException: (code=1, message=The data don't match with schema fields, expect 12 list, got 3)>

This error surely is because the collection was created with an initial schema based on the JSON files (?) and when I try to add the PDF chunks it fails.

Reading the documentation I found that I can create different collections, but then, how can I search on the main JSON collection and at the same time in the PDF collection to get an accurate response in my RAG system?

Is there a way to accomplish that? I appreciate your thoughts on this.

Answered by yhmo

May 9, 2024

Looks like you are using langchain.
The from_documents() is an interface of langchain.VectorStore. It accepts a list of langchian.Document and an embeddings callback function:

def from_documents(
        cls: Type[VST],
        documents: List[Document],
        embedding: Embeddings,
        **kwargs: Any,
    )

Each langchain.Document has two members: page_content and metadata. page_content is a text content of your document, metadata is a dict which contains the extra properties of the document.

class Document(Serializable):
    page_content: str
    metadata: dict = Field(default_factory=dict)

When you first time call the from_documents(), it initializes a Milvus middleware to creat…

View full answer

yhmo · 2024-05-09T02:40:55Z

yhmo
May 9, 2024
Collaborator

Looks like you are using langchain.
The from_documents() is an interface of langchain.VectorStore. It accepts a list of langchian.Document and an embeddings callback function:

def from_documents(
        cls: Type[VST],
        documents: List[Document],
        embedding: Embeddings,
        **kwargs: Any,
    )

Each langchain.Document has two members: page_content and metadata. page_content is a text content of your document, metadata is a dict which contains the extra properties of the document.

class Document(Serializable):
    page_content: str
    metadata: dict = Field(default_factory=dict)

When you first time call the from_documents(), it initializes a Milvus middleware to create a new collection in Milvus:
https://github.com/langchain-ai/langchain/blob/9992beaff9205825993ca65f589e9661bcadd939/libs/community/langchain_community/vectorstores/milvus.py#L294

By default, it deduces a collection schema according to the first metadata you input:
https://github.com/langchain-ai/langchain/blob/9992beaff9205825993ca65f589e9661bcadd939/libs/community/langchain_community/vectorstores/milvus.py#L323

If the second metadata is different to the first, you will get the error "=The data don't match with schema fields"

But if you specify a parameter "metadata_field", it will define a JSON field to store all the metadata in JSON format:
https://github.com/langchain-ai/langchain/blob/9992beaff9205825993ca65f589e9661bcadd939/libs/community/langchain_community/vectorstores/milvus.py#L131

With the JSON metadata field, you can input different metadata when you call the from_documents()
from_documents(documents = xxxx, embeddings=xxxx, metadata_field="META_FIELD")

3 replies

wilsoncastiblanco May 9, 2024
Author

Hey @yhmo thanks for answering. Let me try that. I'll let you know how it goes.

wilsoncastiblanco May 9, 2024
Author

@yhmo I tested it out last night and it worked!! 🎉 but, I'm using partition_key_field so by using metadata_field there was another error saying that the partition_key_field should be VARCHAR or INT64, and it is because internally if the metadata_field is provided, then the metadata is converted into JSON

if self._metadata_field is not None:
            fields.append(FieldSchema(self._metadata_field, DataType.JSON))
        else:

so based on your suggestion, I took only some important fields to match the metadata and it worked that way because it is the same metadata for all the documents

        doc.metadata = {
            "source": doc.metadata["source"], 
            "file_path": doc.metadata["file_path"], 
            "page": doc.metadata["page"], 
            "namespace": id
            }

I removed the metadata_field having standardized the metadata. That was the initial problem, the document metadata was not matching, and some of them had fewer or more fields.

Thank you so much for your help!

yhmo May 10, 2024
Collaborator

Correct, if you are using partition_key_field, the metadata must be standardized. It deduces the schema by the first value type of each meta.

yhmo · 2024-05-09T02:46:03Z

yhmo
May 9, 2024
Collaborator

For a RAG system, no need to create lots of collections. Typically, one collection is ok.
You input documents batch by batch, the VectorStore calls the embeddings callback function to generate vectors from the page_content. Then store the metadata into metadata_field. When you call similarity_search() interface, it will call the search() interface of Milvus and return the topk results along with their metadata. You even no need to care what format the metadata is stored inside the Milvus.

1 reply

wilsoncastiblanco May 9, 2024
Author

After understanding how that works after your explaining, I only have one collection and it worked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-collection in a RAG system #32876

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multi-collection in a RAG system #32876

wilsoncastiblanco May 8, 2024

Replies: 2 comments · 4 replies

yhmo May 9, 2024 Collaborator

wilsoncastiblanco May 9, 2024 Author

wilsoncastiblanco May 9, 2024 Author

yhmo May 10, 2024 Collaborator

yhmo May 9, 2024 Collaborator

wilsoncastiblanco May 9, 2024 Author

wilsoncastiblanco
May 8, 2024

Replies: 2 comments 4 replies

yhmo
May 9, 2024
Collaborator

wilsoncastiblanco May 9, 2024
Author

wilsoncastiblanco May 9, 2024
Author

yhmo May 10, 2024
Collaborator

yhmo
May 9, 2024
Collaborator

wilsoncastiblanco May 9, 2024
Author