Support vectorsets at shard level #2129

jotare · 2024-05-07T15:05:21Z

Description

Describe the proposed changes made in this PR.

How was this PR tested?

Describe how you tested this PR.

github-actions

Benchmark

Benchmark suite	Current: `e185a99`	Previous: `6c53c37`	Ratio
`nucliadb/search/tests/unit/search/test_fetch.py::test_highligh_error`	`13200.175746941139` iter/sec (`stddev: 3.1607053980994163e-7`)	`13198.084460244272` iter/sec (`stddev: 3.4157717375989134e-7`)	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

javitonino · 2024-05-09T14:17:11Z

nucliadb_node/src/grpc/grpc_writer.rs

+        let task = move || {
+            run_with_telemetry(info_span!(parent: &span, "Add a vectorset"), move || {
+                let shard = obtain_shard(shards, shard_id.clone())?;
+                shard.create_vectors_index(NewVectorsIndex {


We talked about including the dimension in the create vectorset request. Are we not doing that for any particular reason?

After today's talk we would also need to eventually add more info like the datatype, etc.

No, I probably missed the conversation. Anyway, changing this would imply changing the new shard request too. Do we want to mix it in this PR?

nucliadb_node/src/shards/shard_writer.rs

javitonino · 2024-05-09T14:29:08Z

nucliadb_node/src/shards/shard_writer.rs

-            result
-        };
+        let mut vector_tasks = vec![];
+        for (_, vector_writer) in indexes.vectors_indexes.iter_mut() {


Won't this write the same vector to all vectorsets?

I guess it's still a placeholder until we change the SetResource message?

Yes, maybe that's a good opportunity to clean protobuffers and pass a custom struct instead of the whole Resource to nucliadb_vectors

javitonino · 2024-05-09T14:31:31Z

nucliadb_node/src/shards/shard_writer.rs

-                merged: 0,
-                left: 0,
-            });
+        // TODO: return metrics by vectorset, not only the deafult one


This is only running a merge on the default index, unless I missed something. I'd change this TODO to indicate that not only it returns default metrics, but also that it's only merging the default index. Or even better, actually merge all indexes.

Yes, I know. I didn't know if I wanted to change merge protos too

codecov · 2024-05-09T15:30:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.92%. Comparing base (169a3e9) to head (e6107a1).
Report is 3 commits behind head on main.

❗ Current head e6107a1 differs from pull request most recent head e185a99. Consider uploading reports for the commit e185a99 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2129      +/-   ##
==========================================
- Coverage   75.02%   74.92%   -0.11%     
==========================================
  Files          80       80              
  Lines        5866     5894      +28     
==========================================
+ Hits         4401     4416      +15     
- Misses       1465     1478      +13

Flag	Coverage Δ
ingest	`70.30% <ø> (-0.14%)`	⬇️
utils	`81.53% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

javitonino · 2024-05-14T10:11:43Z

nucliadb_node/src/shards/shard_writer.rs

+            for (name, vectors_index) in indexes.vectors_indexes.iter() {
+                let runner = vectors_index.prepare_merge(context.parameters);
+                if let Ok(Some(mut runner)) = runner {
+                    let result = runner.run();


runner.run() must be outside of any locks because it's the slow part and we don't want to block other operations in the index meanwhile.

So this needs to be 3 blocks:

{ indexes = read_rw_lock() for each index { prepare_merge() } } for each index { runner.run() } { indexes = write_rw_lock() for each index { record_merge() } }

New, open, set and remove resource, GC and reload

Co-authored-by: Javier Torres <javier@javiertorres.eu>

jotare requested a review from a team May 7, 2024 15:05

jotare marked this pull request as draft May 7, 2024 15:05

github-actions bot reviewed May 8, 2024

View reviewed changes

jotare force-pushed the joanantoniriera4168/sc-10087/support-vectorsets-at-shard-level branch from b2f0d50 to 9b86004 Compare May 8, 2024 16:47

jotare marked this pull request as ready for review May 9, 2024 10:22

jotare force-pushed the joanantoniriera4168/sc-10087/support-vectorsets-at-shard-level branch from cdce481 to 1796ac9 Compare May 9, 2024 11:07

javitonino reviewed May 9, 2024

View reviewed changes

javitonino approved these changes May 10, 2024

View reviewed changes

jotare requested a review from javitonino May 14, 2024 09:09

javitonino reviewed May 14, 2024

View reviewed changes

jotare and others added 19 commits May 14, 2024 12:31

Basic vectorset support in shard writer

409facb

New, open, set and remove resource, GC and reload

Skip implementation of vectorsets merge and replication (for now)

e24b2c2

Add create vectors index function and some renames

ce94df1

Add normalize_vector parameter to new vectorset call

10fc972

Add normalize_vector parameter to new vectorset call

6db4fcd

Implement add_vector_set gRPC call

d6e1aef

Add basic test creating 2 vectors indexes and setting a resource

794dc3c

Implement remove vectorset gRPC call

391bb4a

Implement list vectorsets gRPC call

d77893d

Add more operations on the vectorset test

4cb0f2a

Fix rebase

cffb989

Fix python lints after protos changes

a4173db

Better way to pass parameters to open_vectors_writer

49ebc2c

Co-authored-by: Javier Torres <javier@javiertorres.eu>

Fix

f66e5b2

No need to store ShardIndexes in the reader

cdbff40

Remove print from test

e1fd109

Start vectorsets support in shard reader

4fe83e1

Use proto vectorset on search

d1e3507

More

1765c6b

jotare added 4 commits May 14, 2024 12:33

Merge all vectors indexes and return any error

7e25d86

Add vectorset sentences to index paragraph proto

3e8a700

Use it on the tests

031924e

Fix merge so it doesn't block indexes lock

e185a99

jotare force-pushed the joanantoniriera4168/sc-10087/support-vectorsets-at-shard-level branch from 220d663 to e185a99 Compare May 14, 2024 10:33

javitonino approved these changes May 14, 2024

View reviewed changes

jotare merged commit 55fd11d into main May 14, 2024
107 checks passed

jotare deleted the joanantoniriera4168/sc-10087/support-vectorsets-at-shard-level branch May 14, 2024 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support vectorsets at shard level #2129

Support vectorsets at shard level #2129

jotare commented May 7, 2024

github-actions bot left a comment •

edited

javitonino May 9, 2024

jotare May 9, 2024

javitonino May 9, 2024

jotare May 9, 2024

javitonino May 9, 2024

jotare May 9, 2024

codecov bot commented May 9, 2024 •

edited

javitonino May 14, 2024 •

edited

jotare May 14, 2024

Support vectorsets at shard level #2129

Support vectorsets at shard level #2129

Conversation

jotare commented May 7, 2024

Description

How was this PR tested?

github-actions bot left a comment • edited

Choose a reason for hiding this comment

Benchmark

javitonino May 9, 2024

Choose a reason for hiding this comment

jotare May 9, 2024

Choose a reason for hiding this comment

javitonino May 9, 2024

Choose a reason for hiding this comment

jotare May 9, 2024

Choose a reason for hiding this comment

javitonino May 9, 2024

Choose a reason for hiding this comment

jotare May 9, 2024

Choose a reason for hiding this comment

codecov bot commented May 9, 2024 • edited

Codecov Report

javitonino May 14, 2024 • edited

Choose a reason for hiding this comment

jotare May 14, 2024

Choose a reason for hiding this comment

github-actions bot left a comment •

edited

codecov bot commented May 9, 2024 •

edited

javitonino May 14, 2024 •

edited