refactor: count number of documents using hnswlib #1759

jupyterjazz · 2023-08-23T09:31:25Z

Data storage in HnswDocumentIndex works in the following way:

Vectors are stored on disk using hnswlib.
All other types of data are saved in an SQLITE database.

One of the operations we frequently perform is determining the total number of documents (num_docs()). However, the only way to get number of documents from SQLITE is by scanning the entire table. Even though we've made efforts to reduce the number of times we use this functionality (#1729), it's still a time-consuming process.

For better performance, let's do the following: instead of scanning the SQLITE table, we can use hnswlib's get_current_count function to quickly get the number of documents in the index.

But there's a potential issue with this approach. What if documents don't have associated vectors? get_current_count would return 0.

We have two potential solutions:

Notify/Warn users about this behavior and return 0.
Use to the older method of counting using the SQL table if vector-less documents are detected.

The text was updated successfully, but these errors were encountered:

shobhit9957 · 2023-08-24T11:07:12Z

/attempt #1759

I would like to solve this issue. Can I get some more help.

JoanFM · 2023-08-24T11:11:36Z

Hey @shobhit9957 ,

Sure, the main refactoring to be done is to change the code here:

    def num_docs(self) -> int:
        """
        Get the number of documents.
        """
        if self._num_docs == 0:
            self._num_docs = self._get_num_docs_sqlite()
        return self._num_docs

to something that does not rely on the _get_num_docs_sqlite but from a new private method _get_num_docs_hnsw

shobhit9957 · 2023-08-24T14:33:32Z

so I should replace the _get_num_docs_sqlite to _get_num_docs_hnsw is that correct?
I'm just a beginner, to this open-source-community. Just trying my hands out there...please help me out.

JoanFM · 2023-08-24T14:35:56Z

Yes, this would be the right approach

shobhit9957 · 2023-08-24T14:36:40Z

Thanks! will do the PR Soon.

shobhit9957 · 2023-08-24T15:00:08Z

Hey Joan. Submitted the PR, please check if there are any other issues or mistakes I've done in the PR, I would be happy to solve my mistakes and submit the PR again Joan. Thanks!

shobhit9957 · 2023-08-24T15:11:59Z

def num_docs(self) -> int:
"""
Get the number of documents.
"""
if self._num_docs == 0:
# Replace the old method with the new one
self._num_docs = self._get_num_docs_hnsw()
return self._num_docs

def _get_num_docs_hnsw(self) -> int:
"""
Get the number of documents using the HNSW method.
This method should return the count of documents using the new technique.
"""
# Implement your logic here to get the count using the HNSW method
# For example, you might access some data or perform calculations
# that help you quickly determine the number of documents.
# Then return the count.
return calculated_num_docs
is this correct?

munish0838 · 2023-10-22T17:16:43Z

Hi, is this issue fixed or needs contribution yet.
Would like to contribute

JoanFM · 2023-10-22T17:23:00Z

Hey @munish0838, there is still work to be done

munish0838 · 2023-10-24T10:35:55Z

I would like to start working on this issue

JoanFM · 2023-10-24T11:00:33Z

be my guest. I believe there are parts of the plan that were not applied yet.

jupyterjazz added good-first-issue Suitable as your first contribution to DocArray! area/document-index Concerning Document Index or a Document Index backend labels Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: count number of documents using hnswlib #1759

refactor: count number of documents using hnswlib #1759

jupyterjazz commented Aug 23, 2023 •

edited

shobhit9957 commented Aug 24, 2023

JoanFM commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

JoanFM commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

munish0838 commented Oct 22, 2023

JoanFM commented Oct 22, 2023

munish0838 commented Oct 24, 2023

JoanFM commented Oct 24, 2023

refactor: count number of documents using hnswlib #1759

refactor: count number of documents using hnswlib #1759

Comments

jupyterjazz commented Aug 23, 2023 • edited

shobhit9957 commented Aug 24, 2023

JoanFM commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

JoanFM commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

shobhit9957 commented Aug 24, 2023

munish0838 commented Oct 22, 2023

JoanFM commented Oct 22, 2023

munish0838 commented Oct 24, 2023

JoanFM commented Oct 24, 2023

jupyterjazz commented Aug 23, 2023 •

edited