Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VectorDB hosted solution takes a lot of time to push vectors #51

Open
TheSeriousProgrammer opened this issue Jul 24, 2023 · 13 comments · May be fixed by #52
Open

VectorDB hosted solution takes a lot of time to push vectors #51

TheSeriousProgrammer opened this issue Jul 24, 2023 · 13 comments · May be fixed by #52

Comments

@TheSeriousProgrammer
Copy link

TheSeriousProgrammer commented Jul 24, 2023

I tried to make use of vectordb's hosted provision from jina ai, using commands mentioned in the docs

from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import HNSWVectorDB
import time
import glob

class LogoDoc(BaseDoc):
        embedding: NdArray[768]
        id: str

db = HNSWVectorDB[LogoDoc](
     workspace="hnsw_vectordb",
     space = "ip",
     max_elements = 2700000,
     ef_construction = 256,
     M = 16,
     num_threads = 8
)

if __name__=="__main__" :
	with db.serve() as service :
		service.block()

and tried to push my vectors using the client interface

I have a collection 2.5M 768 dimensional vectors to be stored in the db, so I decided to make batched calls of db.index method with 64k vectors in each call. The code didnt respond to the same, so i tried to change the batch size to 2, the code was able to index at a speed of 5 s/it and the estimated time taken was 27 hours. ( I assume this is happening since the tree construction is happening during each index call)

It would be nice if we could speedup the process by asking the user to push all the documents at first and then perform tree construction upon another specific api call

db.push_documents([doc1 , doc2, doc3, ...])
db.build_tree()

which could replace the

db.index()

and during the build process we could easily block the crud operations with a is_building_tree flag and throw an error named TreeCurrentlyBuildingError() when crud operations are being performed

@JoanFM
Copy link
Member

JoanFM commented Jul 24, 2023

Hey @TheSeriousProgrammer ,

Thanks for raising the issue.

Let's see if this solution could work for you.

Only for HNSW:

  1. Add an endpoint and API called push (this creates a list of Document in memory for HNSW and behaves the same as Index for the ExactNNInMemory).
  2. When index is called, we make sure that all the cached of (pushed Documents is indexed before the input Documents).
  3. Add a potential build_index that has no effect for InMemory, but has for HNSWVectorDB. This will make sure that all pushed docs are indexed properly and removes the cache.

Do you think this could work for your use case?

@JoanFM
Copy link
Member

JoanFM commented Jul 24, 2023

Btw, you do not need to add id to your BaseDoc, BaseDoc already has an ID field that u can handle or that is randomized for you.

@JoanFM JoanFM linked a pull request Jul 24, 2023 that will close this issue
@JoanFM
Copy link
Member

JoanFM commented Jul 24, 2023

Could u give a try to this PR #52?

It would not work in the cloud because it is not released, but locally and with serve it should work and give a hint at wether it can solve ur issues

@TheSeriousProgrammer
Copy link
Author

Sure, will give it a shot

@JoanFM
Copy link
Member

JoanFM commented Jul 25, 2023

Hello @TheSeriousProgrammer ,

Before jumping into this solution, I would like to explore a new solution that would make things simpler:

  1. Use the version 0.0.17.
  2. There is a way to have a more fine-grained way to control the behavior of the batches passed to the vectordb for indexers.

So, you are telling me that you are passing 64k documents in each call, so u must be doing something like this:

from more_itertools import chunked

for batch in chunked(docs, 64000):
    client.index(batch)

This would indeed pass 64000 to the client, but the client internally will batch in requests of size 1000 (this means that the vectordb will try to build the index 64 times). And the client will not return until all the calls are successfull.

What u can do is to pass request_size parameter to the index method and adjust so that u get the best performance. There may be a limitation of the size of a single request that can be passed to the vectordb as grpc only accepts 2GB limits.

So you can try:

from more_itertools import chunked

for batch in chunked(docs, 64000):
    client.index(batch, request_size=64000) # edit this number to the largest value that does not fail

Could you give this approach a try to see if this would satisfy ur needs? This could allow us to avoid adding more methods and complicating the API/interface.

If this you find successful, I would add this into the README for documentation.

Thanks a lot for the help and patience.

@TheSeriousProgrammer
Copy link
Author

I tried the request_size workaround, the request timings still took a lot of time now around 35 hours, dont know why.. Will try the pr

@JoanFM
Copy link
Member

JoanFM commented Jul 25, 2023

I tried the request_size workaround, the request timings still took a lot of time now around 35 hours, dont know why.. Will try the pr

I will try to look into it

@JoanFM
Copy link
Member

JoanFM commented Jul 25, 2023

may I ask how do u generate the embeddings? or they are already computed?

@TheSeriousProgrammer
Copy link
Author

its precomputed

@JoanFM
Copy link
Member

JoanFM commented Jul 25, 2023

what are ur jina and docarray versions?

@JoanFM
Copy link
Member

JoanFM commented Jul 26, 2023

Hey @TheSeriousProgrammer ,

Are you sure u are using vectordb 0.0.17?.

I believe this huge poor performance was due to a bug solved in that version where the configuration passed to the db did not properly applied to the server deployed.

This ended up problematic because I believe most of the time is spent resizing the index (the max_elements parameter is not properly passed).

Can you please check the version?

Thanks

@JoanFM
Copy link
Member

JoanFM commented Aug 3, 2023

I believe that with the latest vectordb release and the latest docarray release there should be better performance for you.

@TheSeriousProgrammer
Copy link
Author

TheSeriousProgrammer commented Sep 12, 2023

Recenlty I was experimenting with other hosted db solutions, one of the db providers suggested us to upload the vectors from a vm in the same infra provider(aws , gcp, azure) and same region as where the db was hosted. I initially thought that this should not impact the perfomance of indexing much as my experiments has each push batch size of barely a 100 with 768 dimensional vectors. But I was wrong. Was able to see a indexing speedup of upto 9x by following the suggested solution.

I know that this might be a no brainer for highly experienced cloud devs but may not be the case with growing AI devs like me. Adding this in the readme might help a lot for fellow ai devs

Possible explanation :
100 batch size with a vector dimension of 768 will consume 100x768x32 bits ~= 0.3MB
But while uploading these vectors are jsonized which will easily blow up the transmission data to 1.2MB (4 times higher or possible more) which can be a bottleneck in cross region bulk uploads

8(bits per chareter) * 16 (guess of number of charecters required to represent a float integer including all additional charecters required by json format) * 768 * 100 = 1.2MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants