Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Is there any way to initialize my index data from the Elastic Search database? #13593

Open
1 task done
mglslg opened this issue May 20, 2024 · 3 comments
Open
1 task done
Labels
question Further information is requested

Comments

@mglslg
Copy link

mglslg commented May 20, 2024

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

How can I get an index initialized with a docstore through VectorStoreIndex.from_vector_store? When I create an index using the VectorStoreIndex.from_vector_store method, I find that its docstore is actually empty, which means that my memory is not initialized with vectors and there are no nodes. Is this a bug?

Or, is there any other way to initialize my index data from the Elastic Search database?

@mglslg mglslg added the question Further information is requested label May 20, 2024
Copy link

dosubot bot commented May 20, 2024

The method VectorStoreIndex.from_vector_store is designed to initialize a VectorStoreIndex instance without pre-populating it with data. It sets up the index structure to be ready for future operations but does not load existing data from the vector store. This is why you're finding the docstore empty after using this method. It's intended behavior, not a bug.

To initialize your index with data from an Elastic Search database, you'll need to explicitly load the data into the index. This could involve using a method like ElasticsearchReader.load_data to query the existing data from Elastic Search and then adding those nodes to the index.

If you're trying to work with data already in a vector store and facing issues with an empty docstore, consider directly interacting with the vector store API or using the index's retrieval methods to fetch and add nodes. For example, you can retrieve all nodes stored in the vector store and add them to your index like this:

retriever = index.as_retriever(similarity_top_k=10000)
source_nodes = retriever.retrieve("fake")
nodes = [x.node for x in source_nodes]

This approach allows you to populate your index with the existing data from your vector store, such as an Elastic Search database.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@logan-markewich
Copy link
Collaborator

@mglslg with all vector db integrations, the docstore is always empty, because all the nodes are stored in the vector db. This is done to simplify storage.

If you need a docstore, you can either manually maintain one outside of the index using the class itself, or set store_nodes_override=True when creating your index. Both options mean you need to save and load the docstore somewhere

@mglslg
Copy link
Author

mglslg commented May 21, 2024

@logan-markewich Thank you for your answer!

I am a little confused about this docstore object.

It seems like the docstore is something like a cache because when I call refresh_ref_docs, I found code like self.docstore.set_document_hash(document.get_doc_id(), document.hash), and later when deciding whether to update, it compares the hash value in the docstore.

Could you please help me explain its design concept? I did not find any explanation related to it in the official documentation. I initially thought that using the refresh_ref_docs method would automatically read data from Elasticsearch and match the hash value, but later discovered that it only matches the cached hash value in the docstore. In the end, I had to manually check the hash value in Elasticsearch for matching. My code is finally like this:

def get_changed_docs(es_index_name: str, doc_list: List[Document]) -> List[Document]:
    es_client = get_es_client()
    changed_doc_list = []
    for doc in doc_list:
        query = {
            "query": {
                "match": {
                    "metadata.doc_id": f"{doc.get_doc_id()}"
                }
            }
        }
        result = es_client.search(index=es_index_name, body=query)

        if not result['hits']['hits']:
            changed_doc_list.append(doc)
            continue

        hits = result['hits']['hits']

        for hit in hits:
            node_content = hit['_source']['metadata']['_node_content']
            node_obj = json.loads(node_content)
            if node_obj['relationships']['1']['hash'] != doc.hash:
                changed_doc_list.append(doc)

    return changed_doc_list


need_refresh_docs = get_changed_docs(es_index_name, mongo_documents)

index.refresh_ref_docs(need_refresh_docs)

Are there any other better implementations in llamaindex framework?

@dosubot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants