Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support proper updates in StringIndex #3

Open
dylanrb123 opened this issue Jun 2, 2023 · 7 comments
Open

Support proper updates in StringIndex #3

dylanrb123 opened this issue Jun 2, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@dylanrb123
Copy link
Contributor

dylanrb123 commented Jun 2, 2023

When you call index.add with an existing ID but a new vector, Voyager will correctly locate the correct spot in the graph to place the new vector and update all of the existing connections. However, the StringIndex abstraction does not support this behavior. As written, the string index will add a new vector and new ID if index.add is called with a name that is already present in the index, instead of updating the existing one.

Instead of adding a new item and keeping the old one around, Voyager should check the existing items list and update the underlying index accordingly with the new vector value so we don't end up with duplicate items in the index.

@karchx
Copy link

karchx commented Jun 20, 2023

Hi @dylanrb123
Just as a doubt, does that imply doing a search algorithm on the HNSW algorithm to find the elements that already exist?

@markkohdev markkohdev added the enhancement New feature or request label Sep 20, 2023
@dylanrb123
Copy link
Contributor Author

Hi @karchx, I updated the description of the issue to make it more clear. I'm not sure I fully understand your question but let me know if anything is still unclear.

@samek
Copy link
Contributor

samek commented Oct 21, 2023

@dylanrb123
you mean like checking If we already have this name in it and if use index of that name ?
eg.

public void addItem(String name, float[] vector) {
    int updateIndex = names.indexOf(name);
    if (updateIndex==-1) {
      int nextIndex = names.size();
      index.addItem(vector, nextIndex);
      names.add(name);
    } else {
      index.addItem(vector, updateIndex);
    }
}

@markkohdev
Copy link
Contributor

Moving discussion about this from #64 to here. Relevant comments:

From @markkohdev

In regards to upserting, after chatting with @dylanrb123 about this change, we don't think that this approach is how we would like to see this implemented. The implementation you have here results in traversal of the entire names list upon every addition in the worst case.

In order to support upserting functionality I think we would need a slightly more involved solution. I'll be implementing StringIndex in the C++ core soon enough and will implement this functionality there (and deprecate this class).

I would recommend removing the upsert functionality for the moment and reducing the scope of this PR. However, if you are very motivated to enable this functionality before StringIndex support in voyager core comes along we can discuss how we might support this using a naive implementation in Java (effectively utilizing parallel HashMaps instead of an ArrayList for managing the strings)

From @samek

Ok, I've removed the functionality.
But I'm keen on making this until we have whole string index in C++ core.
What did you have in mind should we use ConcurrentHashMap ?

@markkohdev
Copy link
Contributor

markkohdev commented Apr 9, 2024

As mentioned above, how we would probably want to see this implemented is a bit complicated as there are performance/memory implications for this to be enabled.

Let's start with how this would work in its basic form:

  • Instead of maintaining names as an List, we would manage two parallel HashMaps: one mapping str -> int ("strToInt") and one mapping int -> str (intToStr)
  • Manage an AtomicInteger maxElementId which contains the current maximum ID (this will be used later)
  • When adding an item with a string key, first check strToInt for presence of that string key.
    • If it already exists, call to the core addItem(s) using the existing key.
    • If it doesn't exist, add generate a new int ID via maxElementId.getAndIncrement() and add the appropriate string and ints to the maintained maps.
  • When querying with a vector, take the IDs from the response and resolve them to their string keys using the intToStr map

Some details for consideration

  • This solution would increase the memory utilization substantially -- 200% of current string storage, and an additional 2 copies of the integer key set. In order to combat this, we should consider providing a flag or perhaps even a derived class which would switch from list to the hashmap solution
  • As you mentioned in your comment, ConcurrentHashMap would likely be ideal here in case of JVM parallelism on top of Voyager's core parallelism

^ cc @dylanrb123

@dylanrb123
Copy link
Contributor Author

I have a WIP branch with this partially implemented using two ConcurrentHashMaps https://github.com/spotify/voyager/tree/upsert-string-index. As @markkohdev mentioned, that approach will substantially increase memory usage to support a feature that most people probably don't care about which is why I didn't finish the implementation. IMO it would be better to just wait until we pull this into the C++ code or make it configurable as Mark suggested.

@samek
Copy link
Contributor

samek commented Apr 10, 2024

@dylanrb123 for me it's important to be able to update existing record. Now question is will you finish this the way @markkohdev suggested or should I give it a go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants