Add: USearch engine #451

ashvardanian · 2023-08-01T07:28:15Z

Thanks for open-sourcing this benchmark, @erikbern!
We have had this fork for a while, and it feels like a good time to contribute.

A couple of things to note:

When you choose f8, recall drops to zero on most datasets included in the benchmark. It is different from embeddings produced with most modern encoder Transformers. It depends on the properties of inserted vectors.
We are planning a 1.0 release soon, which may require changing a couple of characters in the following line:

return self._index.search(np.expand_dims(v, axis=0), k=n)[0][0]

Addressing the first issue, would it make sense to extend the list of datasets with other embeddings?
We have a few such datasets for ANN benchmarks on our HuggingFace profile.

Fix: using new API

erikbern · 2023-08-01T07:39:19Z

Thanks for adding this! Yeah I'm definitely open to other datasets – I believe there are a few issues/PRs with similar questions (reminds me I really need to go through this – sorry for being so behind)

kimihailv · 2023-08-01T11:38:18Z

As I know, ann-benchmarks works with specific dataset format (i.e h5df), so converting is required.
By the way, multimodal cc3m (bigg clip image and text embeddings) and arxiv title-abstract (e5 embeddings) can be good options

ashvardanian · 2023-08-01T12:02:42Z

I also find those to be exciting datasets. I have used both of them when comparing the quality of cross-modal "semantic joins" with uni-modal embeddings in one of my recent posts. Converting to HDF5 shouldn't be an issue at all.

@erikbern, can you please check the conflict with algos.yaml? I don't have permission to resolve the conflict in this UI, can only try rebasing from the upstream.

maumueller · 2023-08-02T10:30:43Z

@ashvardanian Please rebase your PR against master yourself. You are using an outdated directory structure.

Your implementation is supposed to go into ann_benchmarks/algorithms/usearch/.

maumueller · 2023-08-02T10:32:45Z

With regard to new datasets, it would be great if you could run the benchmark on these datasets and share plots. More diversity is appreciated, also with regard to the observed performance/quality-tradeoff. Thanks!

ashvardanian · 2023-08-06T18:32:56Z

@maumueller sounds good! Let's split the work into 2 halves. First, merge USearch, then - new datasets. As for USearch, I am considering a few options for the config.yml and am unsure of the best way to structure it.

float:
  any:
  - base_args: ['@metric']
    constructor: USearch
    disabled: false
    docker_tag: ann-benchmarks-usearch
    module: ann_benchmarks.algorithms.usearch
    name: usearch
    run_groups:
      M-12:
        arg_groups: [{M: 12, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-16:
        arg_groups: [{M: 16, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-24:
        arg_groups: [{M: 24, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-36:
        arg_groups: [{M: 36, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]`
      M-4:
        arg_groups: [{M: 4, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-48:
        arg_groups: [{M: 48, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-64:
        arg_groups: [{M: 64, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-8:
        arg_groups: [{M: 8, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]
      M-96:
        arg_groups: [{M: 96, efConstruction: 500}]
        args: {}
        query_args: [[10, 20, 40, 80, 120, 200, 400, 600, 800]]

The current variant is identical to HNSWlib. We, however, also support f16 and i8 internal downcasting. What would be the best way to expose such functionality?

maumueller · 2023-08-07T06:42:19Z

If you want to run them through the benchmark, they should probably be exposed through a parameter in args. (E.g., [['f32', 'f16', 'i8']].) Please make sure to include the information in the __str__ method of your wrapper class to expose it in the evaluation.

However, note that we run a rather strict timelimit of 2 hours for building + querying. Afterwards, the container will just be killed and you might not get the runs carried out that gave you the best performance.

maumueller · 2023-10-02T06:58:11Z

@ashvardanian What is the status on this issue?

erikbern · 2023-10-15T01:42:38Z

Happy to merge this if you want to rebase!

ashvardanian · 2023-10-17T15:58:30Z

Hey, @erikbern and @maumueller!
Thank you for following up!

As of right now, I'm in the active development phase of USearch v3 and UCall v1, hoping to finish the first by the end of the month. Having USearch in the ann-benchmarks would be great, but I'm unsure if I have time to rebase the current implementation before the v3 release.

That said, even the current implementation of HNSW looks highly competitive, especially after 50 Million vectors, but it seems to make more sense for the big-ann-benchmarks.

maumueller · 2023-10-23T10:38:19Z

Hi @ashvardanian! This looks interesting, but it seems indeed more suitable for the big-ann-benchmarks project. Of course, it would also be interesting in the million-scale that we focus on here :-)

ashvardanian · 2023-10-23T17:34:37Z

@maumueller, I was also thinking about the big-ann-benchmarks, but this year they've shrunk the dataset size from 1 Billion down to tens of Millions... We have a new result coming up for 20 Billion vectors, and I have entirely no clue as to where to publish it 😅

As for datasets, the Arxiv Titles and Abstracts encoded with E5, suggested by @kimihailv are indeed a great option! I've recently used them to build a scientific RAG, so they are pretty representative of the embeddings actually used in production. I guess we can use the glove-25-angular.hdf5 as a reference, right?

>>> f = h5py.File('/Users/av/Downloads/glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f = h5py.File('glove-25-angular.hdf5', 'r')
>>> f.keys()
<KeysViewHDF5 ['distances', 'neighbors', 'test', 'train']>
>>> f["train"].shape
(1183514, 25)
>>> f["distances"].shape
(10000, 100)

patelprateek · 2023-10-25T01:57:13Z

@ashvardanian : the results look good. IIUC the underlying implementation is hnsw , could you please compare the results with nmslib/hnswlib and faiss/hnsw implementation .

ashvardanian and others added 4 commits May 29, 2023 15:19

Add: USearch implementation

c460cce

Fix: using new API

19637a6

Fix: jit metrics

f4f3bd0

Merge pull request #1 from kimihailv/master

6bfb7db

Fix: using new API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: USearch engine #451

Add: USearch engine #451

ashvardanian commented Aug 1, 2023

erikbern commented Aug 1, 2023

kimihailv commented Aug 1, 2023

ashvardanian commented Aug 1, 2023

maumueller commented Aug 2, 2023

maumueller commented Aug 2, 2023

ashvardanian commented Aug 6, 2023

maumueller commented Aug 7, 2023

maumueller commented Oct 2, 2023

erikbern commented Oct 15, 2023

ashvardanian commented Oct 17, 2023

maumueller commented Oct 23, 2023

ashvardanian commented Oct 23, 2023

patelprateek commented Oct 25, 2023

Add: USearch engine #451

Are you sure you want to change the base?

Add: USearch engine #451

Conversation

ashvardanian commented Aug 1, 2023

erikbern commented Aug 1, 2023

kimihailv commented Aug 1, 2023

ashvardanian commented Aug 1, 2023

maumueller commented Aug 2, 2023

maumueller commented Aug 2, 2023

ashvardanian commented Aug 6, 2023

maumueller commented Aug 7, 2023

maumueller commented Oct 2, 2023

erikbern commented Oct 15, 2023

ashvardanian commented Oct 17, 2023

maumueller commented Oct 23, 2023

ashvardanian commented Oct 23, 2023

patelprateek commented Oct 25, 2023