Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: USearch engine #451

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

ashvardanian
Copy link

Thanks for open-sourcing this benchmark, @erikbern!
We have had this fork for a while, and it feels like a good time to contribute.

A couple of things to note:

  1. When you choose f8, recall drops to zero on most datasets included in the benchmark. It is different from embeddings produced with most modern encoder Transformers. It depends on the properties of inserted vectors.
  2. We are planning a 1.0 release soon, which may require changing a couple of characters in the following line:
return self._index.search(np.expand_dims(v, axis=0), k=n)[0][0]

Addressing the first issue, would it make sense to extend the list of datasets with other embeddings?
We have a few such datasets for ANN benchmarks on our HuggingFace profile.

@erikbern
Copy link
Owner

erikbern commented Aug 1, 2023

Thanks for adding this! Yeah I'm definitely open to other datasets – I believe there are a few issues/PRs with similar questions (reminds me I really need to go through this – sorry for being so behind)

@kimihailv
Copy link

As I know, ann-benchmarks works with specific dataset format (i.e h5df), so converting is required.
By the way, multimodal cc3m (bigg clip image and text embeddings) and arxiv title-abstract (e5 embeddings) can be good options

@ashvardanian
Copy link
Author

I also find those to be exciting datasets. I have used both of them when comparing the quality of cross-modal "semantic joins" with uni-modal embeddings in one of my recent posts. Converting to HDF5 shouldn't be an issue at all.

@erikbern, can you please check the conflict with algos.yaml? I don't have permission to resolve the conflict in this UI, can only try rebasing from the upstream.

@maumueller
Copy link
Collaborator

@ashvardanian Please rebase your PR against master yourself. You are using an outdated directory structure.

Your implementation is supposed to go into ann_benchmarks/algorithms/usearch/.

@maumueller
Copy link
Collaborator

With regard to new datasets, it would be great if you could run the benchmark on these datasets and share plots. More diversity is appreciated, also with regard to the observed performance/quality-tradeoff. Thanks!

@ashvardanian
Copy link
Author

@maumueller sounds good! Let's split the work into 2 halves. First, merge USearch, then - new datasets. As for USearch, I am considering a few options for the config.yml and am unsure of the best way to structure it.

float:
  any:
  - base_args: ['@metric']
    constructor: USearch
    disabled: false
    docker_tag: ann-benchmarks-usearch
    module: ann_benchmarks.algorithms.usearch
    name: usearch
    run_groups:
      M-12:
        arg_groups: [{M: 12, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-16:
        arg_groups: [{M: 16, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-24:
        arg_groups: [{M: 24, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-36:
        arg_groups: [{M: 36, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]`
      M-4:
        arg_groups: [{M: 4, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-48:
        arg_groups: [{M: 48, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-64:
        arg_groups: [{M: 64, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-8:
        arg_groups: [{M: 8, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-96:
        arg_groups: [{M: 96, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]

The current variant is identical to HNSWlib. We, however, also support f16 and i8 internal downcasting. What would be the best way to expose such functionality?

@maumueller
Copy link
Collaborator

If you want to run them through the benchmark, they should probably be exposed through a parameter in args. (E.g., [['f32', 'f16', 'i8']].) Please make sure to include the information in the __str__ method of your wrapper class to expose it in the evaluation.

However, note that we run a rather strict timelimit of 2 hours for building + querying. Afterwards, the container will just be killed and you might not get the runs carried out that gave you the best performance.

@maumueller
Copy link
Collaborator

@ashvardanian What is the status on this issue?

@erikbern
Copy link
Owner

Happy to merge this if you want to rebase!

@ashvardanian
Copy link
Author

Hey, @erikbern and @maumueller!
Thank you for following up!

As of right now, I'm in the active development phase of USearch v3 and UCall v1, hoping to finish the first by the end of the month. Having USearch in the ann-benchmarks would be great, but I'm unsure if I have time to rebase the current implementation before the v3 release.

search_speed
recall
construction_speed
construction_memory

That said, even the current implementation of HNSW looks highly competitive, especially after 50 Million vectors, but it seems to make more sense for the big-ann-benchmarks.

@maumueller
Copy link
Collaborator

Hi @ashvardanian! This looks interesting, but it seems indeed more suitable for the big-ann-benchmarks project. Of course, it would also be interesting in the million-scale that we focus on here :-)

@ashvardanian
Copy link
Author

@maumueller, I was also thinking about the big-ann-benchmarks, but this year they've shrunk the dataset size from 1 Billion down to tens of Millions... We have a new result coming up for 20 Billion vectors, and I have entirely no clue as to where to publish it 😅

As for datasets, the Arxiv Titles and Abstracts encoded with E5, suggested by @kimihailv are indeed a great option! I've recently used them to build a scientific RAG, so they are pretty representative of the embeddings actually used in production. I guess we can use the glove-25-angular.hdf5 as a reference, right?

>>> f = h5py.File('/Users/av/Downloads/glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f = h5py.File('glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f["train"].shape
(1183514, 25)
>>> f["distances"].shape
(10000, 100)

@patelprateek
Copy link

@ashvardanian : the results look good. IIUC the underlying implementation is hnsw , could you please compare the results with nmslib/hnswlib and faiss/hnsw implementation .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants