New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/api/embeddings
responds with 500 before Ollama is initialized - handle max queued requests failure better
#4124
Comments
OS: Windows |
/api/embeddings
responds with 500 before Ollama is initialized
@maxggl did you by any chance get a response body in the 500? I'm trying to repro but haven't quite found the combination of factors leading to this failure mode. Can you share any more insight into the sequence of API calls and bodies that results in the failure? My attempts against /api/emebddings always block until the model is loaded. |
@dhiltgen Sure, Requst Body:
Crucially, the responses of the failed requests with status code 500 have this body: I can also confirm that if less asynchronous API calls are made to So does this mean that with version 0.1.33, a limit on how many pending requests Ollama can handle was introduced? Thank you for the effort! |
Taking a look at the changes made from 0.1.32 to 0.1.33, I discovered this line in the newly added sched.go in line 48 This probably has something to do with it, right? |
Yes, this looks like the explanation. I'll get a PR up to refine this to both return a better HTTP status code, and also make the queue depth adjustable. |
/api/embeddings
responds with 500 before Ollama is initialized/api/embeddings
responds with 500 before Ollama is initialized - handle max queued requests failure better
What is the issue?
Hello,
please forgive the ambiguity of this report.
The issue i am encountering now is the following:
Before updating to 0.1.33, i was running on version 0.1.32.
I was running the server with embedding-models for generating embeddings and I was using the langchain OllamaEmbeddings class for it.
I wrote a custom wrapper for asynchronous embeddings to speed up the time it takes to embed documents:
https://github.com/maxggl/rag-experiment/blob/main/get_embedding_function.py
With v. 0.1.32 everything was working fine and all requests returned 200 after the model loaded:
But after updating to 0.1.33, it seems as if the requests were processed and returned 500 because the model was not yet loaded but the server responded anyway, at least in the log it appears to be this way:
Thank you for your help!
OS
No response
GPU
No response
CPU
No response
Ollama version
0.1.33
The text was updated successfully, but these errors were encountered: