Fewer than expected results were retrieved when querying for len(index) items #53

cvillela · 2024-02-28T16:41:14Z

This is a noted issue already, but I am opening another card for visibility as the previous one has been open for over 4 months!

My objective would be to find the furthest neighbor in a index from a specific vector.

Calls for querying for N neighbors in an index of length N results in RuntimeError: Fewer than expected results were retrieved
There are no NaN's in the set.

cluster_index = Index(
            index.space,
            index.num_dimensions,
            index.M,
            index.ef_construction,
            storage_data_type=index.storage_data_type
        )
        
cluster_index.add_items(
       vectors=index.get_vectors(list(cluster_dict[largest_cluster_key])),
       ids=list(cluster_dict[largest_cluster_key])
)

if np.any(np.isnan(cluster_index.get_vectors(list(cluster_index.ids)))):
            print("Nan found!")

print(f"Len Index {len(cluster_index)}")
neighbors, _ = cluster_index.query(
            vectors=any_vector,
            k=len(cluster_index)
        )

outputs

Len Index 828
RuntimeError: Fewer than expected results were retrieved; only found 825 of 828 requested neighbors.

Is this a parameter tuning problem? Such as any of the "ef" parameters in construction or querying?

Please note that this index also does not contain any mark_deleted() elements

The text was updated successfully, but these errors were encountered:

dylanrb123 · 2024-03-29T18:36:44Z

Hi @cvillela, that error occurs when the HNSW lookup is unable to retrieve the requested number of neighbors. This can happen if the underlying graph is for some reason disconnected. You could try increasing the M parameter at build time in order to increased the connectedness of the graph, or increase ef_construction in order to increase the search depth into the graph at build time.

Having said that, what are you trying to accomplish here by retrieving the entire dataset from the index? Might it be better served by just brute force calculating the distance between the target vector and every item in your dataset? You don't really get any advantage from an approximate nearest neighbors search if you end up searching the entire graph, since the library will calculate whatever distance measurement against every resolved candidate and the target vector anyway.

cvillela · 2024-04-09T18:43:16Z

Hey @dylanrb123 , thank you very much for the response and sorry for the delay.

I am building an implementation of K-Means clustering with the Voyager library. An heuristic approximation for the construction of the clusters suffices for my application, and that is why I chose Voyager for the task.

On a specific step of the iteration, I am aiming to find the furthest neighbor inside a predefined cluster in order to set new centroids. This is the call crashing the program.

But I understand your explanation and, indeed, maybe it would make more sense to brute-force this step of the algorithm. Nevertheless I do believe documentation is a bit abstract in explaining the functionality of the construction parameters.

I will be closing this issue! Thanks again for the reply

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fewer than expected results were retrieved when querying for len(index) items #53

Fewer than expected results were retrieved when querying for len(index) items #53

cvillela commented Feb 28, 2024

dylanrb123 commented Mar 29, 2024

cvillela commented Apr 9, 2024

Fewer than expected results were retrieved when querying for len(index) items #53

Fewer than expected results were retrieved when querying for len(index) items #53

Comments

cvillela commented Feb 28, 2024

dylanrb123 commented Mar 29, 2024

cvillela commented Apr 9, 2024