-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: count number of documents using hnswlib #1759
Comments
/attempt #1759 I would like to solve this issue. Can I get some more help. |
Hey @shobhit9957 , Sure, the main refactoring to be done is to change the code here: def num_docs(self) -> int:
"""
Get the number of documents.
"""
if self._num_docs == 0:
self._num_docs = self._get_num_docs_sqlite()
return self._num_docs to something that does not rely on the |
so I should replace the _get_num_docs_sqlite to _get_num_docs_hnsw is that correct? |
Yes, this would be the right approach |
Thanks! will do the PR Soon. |
Hey Joan. Submitted the PR, please check if there are any other issues or mistakes I've done in the PR, I would be happy to solve my mistakes and submit the PR again Joan. Thanks! |
def num_docs(self) -> int: def _get_num_docs_hnsw(self) -> int: |
Hi, is this issue fixed or needs contribution yet. |
Hey @munish0838, there is still work to be done |
I would like to start working on this issue |
be my guest. I believe there are parts of the plan that were not applied yet. |
Data storage in
HnswDocumentIndex
works in the following way:hnswlib
.One of the operations we frequently perform is determining the total number of documents (
num_docs()
). However, the only way to get number of documents from SQLITE is by scanning the entire table. Even though we've made efforts to reduce the number of times we use this functionality (#1729), it's still a time-consuming process.For better performance, let's do the following: instead of scanning the SQLITE table, we can use hnswlib's
get_current_count
function to quickly get the number of documents in the index.But there's a potential issue with this approach. What if documents don't have associated vectors?
get_current_count
would return 0.We have two potential solutions:
The text was updated successfully, but these errors were encountered: