Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fewer than expected results were retrieved when querying for len(index) items #53

Open
cvillela opened this issue Feb 28, 2024 · 2 comments

Comments

@cvillela
Copy link

This is a noted issue already, but I am opening another card for visibility as the previous one has been open for over 4 months!

My objective would be to find the furthest neighbor in a index from a specific vector.

Calls for querying for N neighbors in an index of length N results in RuntimeError: Fewer than expected results were retrieved
There are no NaN's in the set.

cluster_index = Index(
            index.space,
            index.num_dimensions,
            index.M,
            index.ef_construction,
            storage_data_type=index.storage_data_type
        )
        
cluster_index.add_items(
       vectors=index.get_vectors(list(cluster_dict[largest_cluster_key])),
       ids=list(cluster_dict[largest_cluster_key])
)
if np.any(np.isnan(cluster_index.get_vectors(list(cluster_index.ids)))):
            print("Nan found!")

print(f"Len Index {len(cluster_index)}")
neighbors, _ = cluster_index.query(
            vectors=any_vector,
            k=len(cluster_index)
        )

outputs

Len Index 828
RuntimeError: Fewer than expected results were retrieved; only found 825 of 828 requested neighbors.

Is this a parameter tuning problem? Such as any of the "ef" parameters in construction or querying?

Please note that this index also does not contain any mark_deleted() elements

@dylanrb123
Copy link
Contributor

Hi @cvillela, that error occurs when the HNSW lookup is unable to retrieve the requested number of neighbors. This can happen if the underlying graph is for some reason disconnected. You could try increasing the M parameter at build time in order to increased the connectedness of the graph, or increase ef_construction in order to increase the search depth into the graph at build time.

Having said that, what are you trying to accomplish here by retrieving the entire dataset from the index? Might it be better served by just brute force calculating the distance between the target vector and every item in your dataset? You don't really get any advantage from an approximate nearest neighbors search if you end up searching the entire graph, since the library will calculate whatever distance measurement against every resolved candidate and the target vector anyway.

@cvillela
Copy link
Author

cvillela commented Apr 9, 2024

Hey @dylanrb123 , thank you very much for the response and sorry for the delay.

I am building an implementation of K-Means clustering with the Voyager library. An heuristic approximation for the construction of the clusters suffices for my application, and that is why I chose Voyager for the task.

On a specific step of the iteration, I am aiming to find the furthest neighbor inside a predefined cluster in order to set new centroids. This is the call crashing the program.

But I understand your explanation and, indeed, maybe it would make more sense to brute-force this step of the algorithm. Nevertheless I do believe documentation is a bit abstract in explaining the functionality of the construction parameters.

I will be closing this issue! Thanks again for the reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants