Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a reliability mechanism #45

Open
dgarnitz opened this issue Sep 8, 2023 · 1 comment
Open

Add a reliability mechanism #45

dgarnitz opened this issue Sep 8, 2023 · 1 comment

Comments

@dgarnitz
Copy link
Owner

dgarnitz commented Sep 8, 2023

The hugging face, vdb upload and open ai embeddings workers all need a retry mechanism.

The queue system could be leveraged for this, either a general retry queue at each stage or for each individual worker.

There should be logic to prevent retries when critical system components are down (like open AI's api or a vector DB's host)

@dgarnitz
Copy link
Owner Author

dgarnitz commented Nov 9, 2023

I recently added a basic retry mechanism to the worker.py in this PR here. Its a naive implementation of retry, where the system retries a batch up to 3 times by putting it back on the embeddings queue.

What we need to do

  1. Create a retry queue for each existing queue.
  2. Create a dead letter queue, aka dlq, that holds messages that have already been retried 3 times
  3. Create a cron job or scheduled task that
    a) moves things from the retry queue back to the main queue
    b) queries batches that are more than 24 hours old and marks them as FAILED.
    I think this can run once per hour to start.
  4. Add logic to hugging_face/app.py, worker/vdb_upload_worker.py that puts failed batches onto the retry queue. Be selective about where and when you choose to do this. If something fails because a key is missing or a connection URL is wrong, it shouldn't be retried. Probably retries only make sense for very specific types of exceptions
  5. Alter the logic in worker.py to use the retry queue. If something has been retried the maximum number of times, add logic to put it onto the DLQ

Other System Notes

VectorFlow currently has 4 queues:

  1. extraction - queue holds pointer to file that will be turned into batches
  2. embeddings - queue holds batches, will get turned into chunks and either embedded with open AI embeddings or passed to the hugging face model queue for embedding
  3. hugging face model queue - holds chunks for embedding with a hugging face sentence transformer model. Here the name of the queue is the name of the model
  4. vector database upload - holds chunks & vector embeddings that will be uploaded to a vector store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant