New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat: refactor Ollama chat client #3881
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you tell me more about the specific reason PromptString was regressing ollama prompts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are still problems with ollama response decoding and this PR is on the right track. It just needs to go further in handling the incremental responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specific feedback inline.
try { | ||
const { done, value } = await reader.read() | ||
if (typeof value === 'string') { | ||
const parsedData = JSON.parse(value) as OllamaGenerateResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be fragile. We shouldn't assume that the network response is chunked on JSON boundaries, either. We should split on newlines, and then decode. The protocol is a bit like Server-Sent Events, "packets" are terminated by newlines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dominiccooney hey Dom sorry i haven't got a chance to work on this since I''m still working on my sprint works, but I will try to pick this back up tomorrow. Will mark it as a draft until this is ready again, appreciate your feedback
@dominiccooney i've applied your feedback to the PR:
|
// Splits the decoded chunk by the new lines and filters out empty strings. | ||
if (value) { | ||
for (const chunk of value.split(RESPONSE_SEPARATOR).filter(Boolean)) { | ||
const line = JSON.parse(chunk) as OllamaGenerateResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the completions client, we should buffer the chunks in case they are not cut on JSON boundaries.
What do you think about reusing the response streaming/reading part from the completions client to keep this logic in one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valerybugakov yea that'd be great! I will update this PR to reuse the one you created!
This PR fixes the issue of using PromptString as prompt text in the messages we send to the LLM, causing regression in chat response from some Ollama models.
Also updated the chat client for Ollama and Groq to have better error handling:
Log usage on complete for easier debugging purpose
Ollama:
Groq:
Test plan
Before
After