Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Forcemerge and deleted graph nodes #353

Open
neo-anderson opened this issue Jul 7, 2021 · 3 comments
Open

Forcemerge and deleted graph nodes #353

neo-anderson opened this issue Jul 7, 2021 · 3 comments
Assignees

Comments

@neo-anderson
Copy link

neo-anderson commented Jul 7, 2021

This plugin rocks! Thank you!

I have a question regarding graph nodes marked for deletion. From reading HNSW's Github issues, I understand that nodes are discoverable even if their neighbors are all deleted. However, I assume that the kNN recall would drop as the deleted nodes pile up. I'd want to periodically rebuild the graphs to maintain high recall.

The recommendation in OpenDistro documentation is to forcemerge to 1 segment. However, I don't want to do this since

  • the index should be read-only / not actively written to (according to the warning in ES documentation)
  • if the resulting segment is greater than 5 GB and nodes in the corresponding graphs later get marked for deletion, they will not be actually deleted as house keeping ignores segments > 5GB

Assuming the above points are correct,

  1. Is a full merge required or will setting only_expunge_deletes to True achieve the same result? i.e, delete the marked nodes and "fix" links. I'm assuming no, since graphs are rebuilt only when segments are merged.
  2. Is there a way to run forcemerge without the risk of ending up with a big segment that will be ignored during future merges?
  3. If I dont force merge, will the graphs get rebuilt when ES "periodically" merges the segments? Unfortunately, I can't find much info about what "periodically" means. If I dont write to the index, but there are some deleted nodes, will the cleanup eventually happen?
  4. Is it possible to estimate how much the recall will be affected as the number of marked nodes increase?
@vamshin vamshin self-assigned this Jul 7, 2021
@vamshin
Copy link
Member

vamshin commented Jul 9, 2021

Hi @neo-anderson

Is a full merge required or will setting only_expunge_deletes to True achieve the same result? i.e, delete the marked nodes and "fix" links. I'm assuming no, since graphs are rebuilt only when segments are merged.

Your understanding is right. Graphs will be rebuilt when segments are merged or number of segments in a shard reach a particular threshold. In case of deletes, I think setting only_expunge_deletes seems like a good idea. You can define threshold beyond which expunge should be considered. I find this thread useful

Is there a way to run forcemerge without the risk of ending up with a big segment that will be ignored during future merges?

Possible. You can scale horizontally by adding more shards and this way you wont end up with huge shards and inturn huge segments. Also you do not have to forcemerge to single segment, having around 5 segments per shard should be decent enough definitely at the cost of some latency.

If I dont force merge, will the graphs get rebuilt when ES "periodically" merges the segments? Unfortunately, I can't find much info about what "periodically" means. If I dont write to the index, but there are some deleted nodes, will the cleanup eventually happen?

Correct. Graphs get rebuilt when ES merges the segment. To understand about merges, I find this blog useful

Is it possible to estimate how much the recall will be affected as the number of marked nodes increase?

Hard to determine as there are multiple variables here in terms of algo params. But if you are worried about delete docs, set the expunge threshold to a very low value, so that lucene would trigger merges. You may want to run some experiments to get an estimate.

@neo-anderson
Copy link
Author

If graphs are only rebuilt when segments are merged, what would happen when forcemerge is run with only_expunge_deletes ? Will it delete the marked nodes in each graph AND rebuild the graphs?

Also, do you think my concern of nodes marked for deletion impacting ANN recall is not valid? According to this post, search wouldn't be affected even if a lot of nodes are marked for deletion. If so, will recall of say top 10 neighbors also not be affected?

@vamshin
Copy link
Member

vamshin commented Jul 17, 2021

If graphs are only rebuilt when segments are merged, what would happen when forcemerge is run with only_expunge_deletes ? Will it delete the marked nodes in each graph AND rebuild the graphs?

only_expunge_deletes also triggers forcemerge on the segments that contain deleted documents over a certain threshold. New graphs would be constructed only for these segments with out deleted nodes/documents.

Also, do you think my concern of nodes marked for deletion impacting ANN recall is not valid? According to this post, search wouldn't be affected even if a lot of nodes are marked for deletion. If so, will recall of say top 10 neighbors also not be affected?

Recall could be impacted based on the percentage of deleted docs/nodes in the graph. Its hard to exactly determine the impact on recall. If the deleted nodes happen to be the nearest neighbors to the query, then recall could be impacted.

One solution for this problem is to consider large k, and have size attribute determine final neighbors. This way each segment would consider more neighbors and as a result you would still possibly end up with nodes that are behind the deleted nodes and closest to the query

For Example: If I would like to get 5 neighbors to my query, I would chose size=5 and k=a*5 (a>=1). magnitude of 'a' really depends on possibility of number of deleted docs. If your use case has higher number of deleted docs, then choose large a. Note, choosing large a should not have great impact on search latency. You may want to experiment at your side.

Query to retrieve 5 nearest neighbors would look something like

GET my-knn-index-1/_search
{
  "size": 5,
  "query": {
    "knn": {
      "my_vector2": {
        "vector": [2, 3, 5, 6],
        "k": 50
      }
    }
  }
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants