[Feature Request]: StreamingAgentChatResponse Raw TGI Response Access #13524

sshearing · 2024-05-15T21:45:06Z

Feature Description

Hi,

I was hoping to request a feature to allow access to the raw response generated by the LLM in the StreamingAgentChatResponse that is returned by any of llama_indexes chat engines, when called in streaming mode. Specifically, we can currently implement a custom LLM that returns a CompletionResponse with both the text delta and the raw response json string set as keywords. For example, this function might be written as such:

@llm_completion_callback()
def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
text = ''
for response in self.model_endpoint.generate_stream(prompt, <other_keyword_arguments>):
text += response.token.text
yield CompletionResponse(text=text, delta=response.token.text, raw=response.json())

I can't remember if in the documentation this raw needs to be a string json or a dict, but its pretty easy to transform that with json.dumps or json.loads as necessary.

Unfortunately, even if we pass in the raw response in the CompletionResponse, the information is lost in the StreamingAgentChatResponse class. Specifically, when we start doing the awrite history method call, we start looping through the achat_stream generator in order to pull the information from the CompletionResponse into the Agent. However, this creates a lock on the achat_stream so we cannot access it anywhere else in the code. A traditional solution to this is to move information into a queue, as you do for the delta. This makes the raw text accessible for downstream applications. However, the full raw response (for example, if you wanted the generated_text, or top_tokens, or token information, or special information that the LLM returns), is never put into a queue and thus we cannot grab any of the non-token text information.

I am hoping to have this class get functionality added to it that would allow for the raw response to be accessed, assuming the LLM returned that information in the CompletionResponse. This would entail:

Adding in a new aqueue class that stores the raw dictionaries and corresponding helper methods / if clauses.
Updating awrite_response_to_history (and I guess write_response_to_history) to add raw info to the new raw queues.
Adding a new "async_raw_response_gen" method that functions exactly like the existing gens, just for raw not delta.

Reason

Currently, this feature cannot be supported in async mode due to issues with asynchronous processing. The awrite_response_to_history method locks the achat_stream generator, which is where the raw response is located, thus making it so we cannot manually obtain it. Within the awrite_response_to_history method, that information is ignored and never stored anywhere, causing us to lose the information despite the fact that we were able to pass it all the way up to this point.

I have tried extending the StreamingAgentChatResponse object, but unfortunately this causes a cascade of other extensions you need to make; you need to extend each chat engine class that you might need to make those use the extended object. When I tried to accomplish this in this way, every class updated led me to updating 3 more classes until I gave up and am asking for a feature update.

Value of Feature

The primary purpose of this feature would be to enable "pass-through" of the TGI endpoint. Currently, llama lets us wrap the TGI endpoint to provide it access to a RAG system for new information it hadn't seen in training. But because the information is lost, we cannot make the RAG system look like a TGI endpoint itself. This is an issue if we wanted to essentially insert the RAG system as a 'man-in-the-middle' with a front end that expects a TGI endpoint, such as Huggingface's Chat UI. Technically, we can do this hookup by essentially wrapping our text token response in a bunch of dummy values to mimic the TGI endpoint interface, but if we ever did need to access that information for whatever reason we'd be back at square zero.

More generally, being able to bubble up the raw response from the LLM would allow us to create a server that mimics the TGI documentation cleanly, which would allow us to connect a RAG system we build to any downstream application that expects a TGI endpoint.

sshearing · 2024-05-15T21:48:26Z

Fwiw, our current workaround for this issue is to change our LLM's stream_complete method to pass the string version of the raw response information as the delta text. Which is pretty hacky and not how those variables are meant to be used and might be breaking other functionality we currently aren't using but might want to in the future.

logan-markewich · 2024-05-15T21:49:46Z

This feels pretty complicated to implement (and admittedly lower priority). If you have an idea for an implementation, I welcome a PR

sshearing added enhancement New feature or request triage Issue needs to be triaged/prioritized labels May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: StreamingAgentChatResponse Raw TGI Response Access #13524

[Feature Request]: StreamingAgentChatResponse Raw TGI Response Access #13524

sshearing commented May 15, 2024

sshearing commented May 15, 2024

logan-markewich commented May 15, 2024

[Feature Request]: StreamingAgentChatResponse Raw TGI Response Access #13524

[Feature Request]: StreamingAgentChatResponse Raw TGI Response Access #13524

Comments

sshearing commented May 15, 2024

Feature Description

Reason

Value of Feature

sshearing commented May 15, 2024

logan-markewich commented May 15, 2024