Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: StreamingAgentChatResponse Raw TGI Response Access #13524

Open
sshearing opened this issue May 15, 2024 · 2 comments
Open
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized

Comments

@sshearing
Copy link

Feature Description

Hi,

I was hoping to request a feature to allow access to the raw response generated by the LLM in the StreamingAgentChatResponse that is returned by any of llama_indexes chat engines, when called in streaming mode. Specifically, we can currently implement a custom LLM that returns a CompletionResponse with both the text delta and the raw response json string set as keywords. For example, this function might be written as such:

@llm_completion_callback()
def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
text = ''
for response in self.model_endpoint.generate_stream(prompt, <other_keyword_arguments>):
text += response.token.text
yield CompletionResponse(text=text, delta=response.token.text, raw=response.json())

I can't remember if in the documentation this raw needs to be a string json or a dict, but its pretty easy to transform that with json.dumps or json.loads as necessary.

Unfortunately, even if we pass in the raw response in the CompletionResponse, the information is lost in the StreamingAgentChatResponse class. Specifically, when we start doing the awrite history method call, we start looping through the achat_stream generator in order to pull the information from the CompletionResponse into the Agent. However, this creates a lock on the achat_stream so we cannot access it anywhere else in the code. A traditional solution to this is to move information into a queue, as you do for the delta. This makes the raw text accessible for downstream applications. However, the full raw response (for example, if you wanted the generated_text, or top_tokens, or token information, or special information that the LLM returns), is never put into a queue and thus we cannot grab any of the non-token text information.

I am hoping to have this class get functionality added to it that would allow for the raw response to be accessed, assuming the LLM returned that information in the CompletionResponse. This would entail:

Adding in a new aqueue class that stores the raw dictionaries and corresponding helper methods / if clauses.
Updating awrite_response_to_history (and I guess write_response_to_history) to add raw info to the new raw queues.
Adding a new "async_raw_response_gen" method that functions exactly like the existing gens, just for raw not delta.

Reason

Currently, this feature cannot be supported in async mode due to issues with asynchronous processing. The awrite_response_to_history method locks the achat_stream generator, which is where the raw response is located, thus making it so we cannot manually obtain it. Within the awrite_response_to_history method, that information is ignored and never stored anywhere, causing us to lose the information despite the fact that we were able to pass it all the way up to this point.

I have tried extending the StreamingAgentChatResponse object, but unfortunately this causes a cascade of other extensions you need to make; you need to extend each chat engine class that you might need to make those use the extended object. When I tried to accomplish this in this way, every class updated led me to updating 3 more classes until I gave up and am asking for a feature update.

Value of Feature

The primary purpose of this feature would be to enable "pass-through" of the TGI endpoint. Currently, llama lets us wrap the TGI endpoint to provide it access to a RAG system for new information it hadn't seen in training. But because the information is lost, we cannot make the RAG system look like a TGI endpoint itself. This is an issue if we wanted to essentially insert the RAG system as a 'man-in-the-middle' with a front end that expects a TGI endpoint, such as Huggingface's Chat UI. Technically, we can do this hookup by essentially wrapping our text token response in a bunch of dummy values to mimic the TGI endpoint interface, but if we ever did need to access that information for whatever reason we'd be back at square zero.

More generally, being able to bubble up the raw response from the LLM would allow us to create a server that mimics the TGI documentation cleanly, which would allow us to connect a RAG system we build to any downstream application that expects a TGI endpoint.

@sshearing sshearing added enhancement New feature or request triage Issue needs to be triaged/prioritized labels May 15, 2024
@sshearing
Copy link
Author

Fwiw, our current workaround for this issue is to change our LLM's stream_complete method to pass the string version of the raw response information as the delta text. Which is pretty hacky and not how those variables are meant to be used and might be breaking other functionality we currently aren't using but might want to in the future.

@logan-markewich
Copy link
Collaborator

This feels pretty complicated to implement (and admittedly lower priority). If you have an idea for an implementation, I welcome a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants