Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algo updates #335

Open
10 of 15 tasks
erikbern opened this issue Mar 8, 2023 · 15 comments
Open
10 of 15 tasks

Algo updates #335

erikbern opened this issue Mar 8, 2023 · 15 comments

Comments

@erikbern
Copy link
Owner

erikbern commented Mar 8, 2023

I'm planning to re-run all benchmarks shortly. Going through the algos with some todos (checking boxes as I fix it)

Currently failing in CI:

  • diskann – something with clang
  • kgraph: something with boost during compilation – hasn't been updated since 2019, planning to drop it
  • mrpt – just needs scikit-learn not sklearn
  • nearpy: something with cython. But it's deactivated in algos.yml anyway – let's disable in CI
  • pynndescent: failing trying to compile numpy from source. should be easy to fix
  • rpforest: something with the compilation step. but it failed on most benchmarks last time and was extremely slow on others – let's deactivate in algos.yml and disable in CI
  • sptag: failing on some cuda thing – i can look into it
  • vespa: failing with some python dep – let's fix
  • vearch: also something with numpy compilation
  • qdrant: something with numpy/hdf5 – it's also pinned to an old version (0.11.3) so let's upgrade to the 1.0 series
  • qsngt: something with the wheel, i'll take a look

Some algos we should probably disable because their performance is bad anyway:

  • nmslib balltree
  • elastiknn (although maybe it's worth keeping – a lot of people will start with this one)
  • puffinn (not terrible, but consistently among the worst)
  • ckdtree
This was referenced Mar 8, 2023
@maumueller
Copy link
Collaborator

I honestly feel that disabling these implementations is a bad idea. I would rather disable the 5+ implementations of the same idea (HNSW) instead of disabling unique approaches that might be performing a bit worse.

@erikbern
Copy link
Owner Author

erikbern commented Mar 8, 2023

Yeah I'm open to keeping some of those – any particular one in particular that you think we should not disable?

Mostly just trying to reduce the clutter in the graphs and the total runtime of the benchmarks. But like you said, maybe a better approach would be to reduce some of the duplication. Is there a particular set of hnsw algos you think we could disable?

@maumueller
Copy link
Collaborator

maumueller commented Mar 9, 2023

I think BallTree(nmslib) (a standard tree data structure) and puffinn (LSH) are worth keeping. As you already mentioned, elastiknn is going to be a default choice for many users, so it would be great if it were to show up in the benchmark.

With respect to decluttering, I would say (but I'm happy if the authors of the implementation were to disagree)

  • SW-Graph < HNSW (disable SW-Graph)
  • faiss-ivf < faiss-ivfpqfs (disable faiss-ivf)
  • vamana(diskann) < vamana-pq(diskann) (disable vamana(diskann))
  • hnswlib, hnsw(faiss), hnsw(nmslib), hnsw(vespa), n2 are all HNSW. Pick one? (hnswlib is the reference implementation)

However, many users will use FAISS, so it is probably worth having their hnsw implementation in the benchmark as well.

I plan to run the benchmark locally on a server and record the total time spent for each implementation to get a bit more insight into where we actually spend most of the time.

@erikbern
Copy link
Owner Author

Excellent points! I think my only counterpoint would be to maybe keep the hnsw(vespa) since I think that's the only vespa one?

Also agree about making sure no algo blows up the total benchmark time. I could potentially add some output for that. Open to subsampling any one that takes too much total time

@WPJiang
Copy link

WPJiang commented Mar 16, 2023

Hi, @erikbern , it's good news to know you are planning to re-run all benchmarks. Just yesterday, we have updated our algorithm qsgngt to a new version, which can fix the problem, “qsngt: something with the wheel, i'll take a look” . You may need to rebuild the docker image of qsgngt. Please let me know if there is any question with rerunning qsgngt, thanks a lot.

@erikbern
Copy link
Owner Author

sounds great. i'll check in the next few days!

@WPJiang
Copy link

WPJiang commented Mar 20, 2023

sounds great. i'll check in the next few days!

Hi,@erikbern ,I wonder if there is any problem on rerunning qsgngt algorithm, we have runned the annbenchmarks and found it has a very good performance . We hope this algorithm can keep up with your re-run plan. Thank you!

@jobergum
Copy link

Excellent points! I think my only counterpoint would be to maybe keep the hnsw(vespa) since I think that's the only vespa one?

Thanks, I would be sad if Vespa was deleted from the runs.

@benwtrent
Copy link
Contributor

hey y'all,

I have been thinking about this recently myself. I am concerned about ANY algorithm that uses network serialization in its run. This is Elasticsearch, Opensearch, Weaviate, QDrant. These aren't really "algorithms" in the same sense as scann or faiss.

Vespa addressed this issue by adding a special "benchmark" python client that directly wraps the native HNSW implementation. But this just isn't feasible for many typically HTTP driven solutions.

What do we think of just removing all "remote network call" algorithms? Or labeling them as such and comparing them that way? Once you start getting into very efficient algorithms, the network and serialization overhead becomes the dominating factor.

@benwtrent
Copy link
Contributor

@erikbern ^ I will happily create a different issue for this discussion. Just seemed related to an overall update to the algos :)

@erikbern
Copy link
Owner Author

I think it would make sense to break out the client-server based ones into a separate chart – makes sense to keep them separate. That being said I don't think the network/serialization overhead should be substantial, at least not for high-recall regimes

@benwtrent
Copy link
Contributor

That being said I don't think the network/serialization overhead should be substantial, at least not for high-recall regimes

For very high recalls, the constant overhead impact is lessened. However, serializing over the network (regardless of RESTful, protobufs, etc.) will always have a significant overhead when compared to native memory access. The benchmark then becomes "How can reduce the network serialization overhead the best" vs. "who has the better vector search algorithm".

Benchmarking things fairly, correctly, and realistically is a difficult thing. Thank you for all your work on this project @erikbern! It really does help the vector search community.

@maumueller
Copy link
Collaborator

@benwtrent Good point. We implemented the "prepared query" mode to make it possible to hide this cost, but looking at the current implementations (e.g., https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/algorithms/luceneknn/module.py#L101-L123), it seems it's not exactly used in this way. The idea was that at query time you only send a small signal to carry out the actual query, but all serialization/transfer is done in the prepare_query and get_reprepared_....

I think it would be great to tag each implementation with information if this is algorithm or library and some info on the basic algorithm that is implemented (like our huge collection of HNSW implementations...). Can we just decorate the constructor(s) in each module?

@benwtrent
Copy link
Contributor

@maumueller

We implemented the "prepared query" mode ... The idea was that at query time you only send a small signal to carry out the actual query, but all serialization/transfer is done in the prepare_query and get_reprepared_....

This assumes that the service implements a "store query & execute later" API. I wouldn't expect many vector search systems to implement this as its not a common case (users of these systems just query with a vector and expect a result). It would be strange to me to make a special API in a production system just for benchmarking.

I think these things could be decorations on the constructor. Or, I was thinking in the algo definition file having a new field tagging the algorithm kind or something. This way its easily filterable in runs and result files can be annotated with those tags.

@maumueller
Copy link
Collaborator

Right, the config is probably a better place at this point in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants