Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Read-Only Vectorstore with GCS persistence goes stale #2612

Open
1 task
rjrebel10 opened this issue Sep 21, 2023 · 7 comments
Open
1 task

[BUG] Read-Only Vectorstore with GCS persistence goes stale #2612

rjrebel10 opened this issue Sep 21, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@rjrebel10
Copy link

rjrebel10 commented Sep 21, 2023

Severity

P0 - Critical breaking issue or missing functionality

Current Behavior

When running the Deeplake Vectorstore with a GCS path, any changes and commits made by a separate Deeplake instance on the same GCS path does not get picked up by the already running Deeplake Vectorstore instance.

Steps to Reproduce

  1. Run a Deeplake Vectorstore with a Google cloud storage path in read-only mode
  2. Run a separate Deeplake Vectorstore with the same GCS path and push some new data to the Vectorstore
  3. Perform a search with the first Deeplake Vectorstore instance and see if the new data is reflected. The new data typically does not get reflected.

Expected/Desired Behavior

A Deeplake Vectorstore with cloud persistence should periodically pick up and pull any changes made to the peristed data by another vectorstore instance.

Alternatively, provide a refresh method to trigger any Deeplake Vectorstore to refresh its data from cloud persistence.

Python Version

No response

OS

No response

IDE

No response

Packages

No response

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR (Thank you!)
@rjrebel10 rjrebel10 added the bug Something isn't working label Sep 21, 2023
@mikayelh
Copy link
Collaborator

hi @ rjrebel10 , apologies for the late followup on this - sorry you've run into an issue with Deep Lake. Worry not, I'm looping in @tatevikh who will advise further.

@nvoxland
Copy link
Contributor

I'm wondering if it's related to the caching layer storing previously saved versions of files.

Does running your_vector_store.dataset.clear_cache() on the read only instance make it start reading current data?

@rjrebel10
Copy link
Author

@nvoxland I tried the clear_cache() method and it did not work. It still only shows the stale data and does not see the new commit to the dataset.

@irpepper
Copy link

irpepper commented Nov 2, 2023

+1

@mikayelh
Copy link
Collaborator

mikayelh commented Nov 2, 2023

Hi @irpepper, do you mind sharing more information on your end that would help us troubleshoot? Also looping in @istranic for visibility.

@kevroy314
Copy link

Seeing the behavior @rjrebel10 describes. Gotta work around it by basically redownloading everything manually which makes the connector not all that useful.

@nvoxland
Copy link
Contributor

nvoxland commented Nov 8, 2023

What you are seeing is the currently expected behavior. When you load a dataset, you are connecting to that the current point in time and remains consistent with that.

We are working on longer term changes that will allow the data you get back from the dataset to be able to remain fixed when you need it to be fixed and up-to-date when you need it to be up to date.

In the meantime, we're looking at adding a way to refresh a currently loaded dataset beyond simply calling ds = deeplake.load(...) again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants