Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/streaming #314

Merged
merged 33 commits into from
May 22, 2024
Merged

Feature/streaming #314

merged 33 commits into from
May 22, 2024

Conversation

gecBurton
Copy link
Collaborator

@gecBurton gecBurton commented May 7, 2024

Context

This is a POC to investigate:

  • how streaming works in langchain
  • if/how websockets integrate with fastapi
  • how chat-requests and responses might work with websockets

I have not:

  • tested this code
  • given any thought to security
  • considered how to limit the document retrieval to just those owned by the requestor
  • integrated this with the existing rag prompts/code

Changes proposed in this pull request

Guidance to review

I have included a small http page to demonstrate this working, note that the chat is sent as text and the documents as binary, although this is just the first thing i thought of an not necessarily the best way to do this.

docker compose up core-api elasticsearch

http://0.0.0.0:5002/chat/chit-chat

image

Relevant links

Things to check

  • I have added any new ENV vars in all deployed environments
  • I have tested any code added or changed
  • I have run integration tests

@KevinEtchells
Copy link
Contributor

Thanks for your work on this @gecBurton! I'm not able to test locally (general issue running core-api, not to do with this work), but it looks promising.

Thinking in terms of how this looks to the user client-side, I think it'd make sense for the text to be streamed, and the documents to be sent all together as one chunk. I think Django should handle this though, so core-api just streams away. Not sure if using binary for the documents affects this. @brunns do you have any thoughts on this?

@gecBurton
Copy link
Collaborator Author

gecBurton commented May 7, 2024

Thanks for your work on this @gecBurton! I'm not able to test locally (general issue running core-api, not to do with this work), but it looks promising.

Thinking in terms of how this looks to the user client-side, I think it'd make sense for the text to be streamed, and the documents to be sent all together as one chunk. I think Django should handle this though, so core-api just streams away. Not sure if using binary for the documents affects this. @brunns do you have any thoughts on this?

@KevinEtchells It appears that the documents are the first object to be sent back and all in one go (which i guess makes sense as the R needs to be done before the AG?). Binary streams in the same way as text. But it might make more sense to wrap them in and identifier and stream both as text, i.e.:

{"resource_type": "documents", "data": [...]}
{"resource_type": "text", "data": "As a large language model I cannot..."}

this way the client can deal with the data as it chooses?

@KevinEtchells
Copy link
Contributor

KevinEtchells commented May 7, 2024

Thanks for your work on this @gecBurton! I'm not able to test locally (general issue running core-api, not to do with this work), but it looks promising.
Thinking in terms of how this looks to the user client-side, I think it'd make sense for the text to be streamed, and the documents to be sent all together as one chunk. I think Django should handle this though, so core-api just streams away. Not sure if using binary for the documents affects this. @brunns do you have any thoughts on this?

@KevinEtchells It appears that the documents are the first object to be sent back and all in one go (which i guess makes sense as the R needs to be done before the AG?). Binary streams in the same way as text. But it might make more sense to wrap them in and identifier and stream both as text, i.e.:

{"resource_type": "documents", "data": [...]}
{"resource_type": "text", "data": "As a large language model I cannot..."}

this way the client can deal with the data as it chooses?

Great that documents all comes all in one go, and it makes sense in terms of what you have available first. We'll likely be displaying the resources last and therefore want to prioritise getting the text in as soon as possible, but whether sending documents first actually has any noticeable affect on performance would need to be tested. So probably best to stick to this default behaviour for now.
The wrappers sound like a good idea!

@brunns
Copy link
Collaborator

brunns commented May 7, 2024

Just checking - we are going to be using a json structure here, even though it'll be coming over in chunks?

@gecBurton
Copy link
Collaborator Author

gecBurton commented May 7, 2024

Just checking - we are going to be using a json structure here, even though it'll be coming over in chunks?

@brunns everything is open to change

currently the text is streamed as text the documents are sent in one binary chunk.

we are considering whether the documents could/should also be sent as text but then the question is "as a client how do i know if the chunk i have just received is a document or a piece of text?"

@brunns
Copy link
Collaborator

brunns commented May 7, 2024

I see the issue. Trouble is, partial JSON is a real pain to deal with too. How about; firstly, documents are sent chunk by chunk, followed by a known delimiter, followed by AI messages.

@KevinEtchells
Copy link
Contributor

I see the issue. Trouble is, partial JSON is a real pain to deal with too. How about; firstly, documents are sent chunk by chunk, followed by a known delimiter, followed by AI messages.

That's a good point. We want to avoid JSON for the main text. Documents less critical (especially if coming in one chunk), but I think @brunns suggestion is a good one.

@gecBurton
Copy link
Collaborator Author

gecBurton commented May 7, 2024

I see the issue. Trouble is, partial JSON is a real pain to deal with too. How about; firstly, documents are sent chunk by chunk, followed by a known delimiter, followed by AI messages.

That's a good point. We want to avoid JSON for the main text. Documents less critical (especially if coming in one chunk), but I think @brunns suggestion is a good one.

I was thinking that the chunks themselves could be json encoded, i.e.:

chunk 1

{"resource_type": "text", "data": "As a large language model"}

chunk 2

{"resource_type": "text", "data": " I cannot comment"}

chunk 3

{"resource_type": "documents", "data": [{"url": "s3://amazon.sts/redboxdata/myDocyment.pdf"}]}

chunk 4

{"resource_type": "text", "data": " on whatever it is you just asked me."}

this way:

  • nothing is partial
  • it is always clear whether the chunk is:
    • The Documents
    • A piece of text

or we could stick to the binary=documents, text=chat distinction ?

@brunns
Copy link
Collaborator

brunns commented May 7, 2024

I'd be very happy with that. The text might come in multiple chunks, right?

@gecBurton
Copy link
Collaborator Author

I'd be very happy with that. The text might come in multiple chunks, right?

yep, almost certainly

MessagesPlaceholder(variable_name="chat_history"),
("user", "{input}"),
(
"user",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a system prompt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this question might be moot when integrating with existing code from redbox/llm/prompts/chat.py )

while True:
chat_request = ChatRequest.parse_raw(await websocket.receive_text())
chat_history = [
HumanMessage(content=x.text) if x.role == "user" else AIMessage(content=x.text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be SystemMessages or will they have been contracted by the chains?

@rachaelcodes
Copy link
Contributor

An interesting UX question - do we want users to be able to see the generated question with their responses? (e.g. 'Does Rishi Sunak have a cat?' in George's example) Perhaps it could be something that could be revealed behind a click?

Or would it be better to show just the questions they wrote, the answers and the sources?

// Listen for messages
ws.addEventListener("message", (event) => {
if (event.data instanceof ArrayBuffer) {
document.getElementById('documents').innerHTML = JSON.stringify(JSON.parse(decoder.decode(event.data)),null,2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be keeping a record of the sources throughout the chat rather than just the most recent? If so, would this be embedded with the relevant chat answer or all in one block at the end?

core_api/src/routes/chat.py Outdated Show resolved Hide resolved
@brunns brunns marked this pull request as ready for review May 22, 2024 11:47
@brunns brunns requested a review from rachaelcodes May 22, 2024 11:47
communicator.scope["user"] = carol
connected, subprotocol = await communicator.connect()
assert connected
with patch("redbox_app.redbox_core.consumers.connect", new=mocked_connect):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@brunns brunns merged commit ce662c7 into main May 22, 2024
9 checks passed
@brunns brunns deleted the feature/streaming branch May 22, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants