You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the way the session API works is that it keeps history, so the first command generates [Output 1] and then the second command generates starting from [Input 1][Output 1][Input 2]. But compared to the usual transformers API this is quite restrictive, it is really only useful for chat-like applications where you can never go back and edit anything.
It would be a more powerful API if instead the .generate() calls acted as though they were unrelated/independent, and then the session managed the reuse logic internally. So for example if you wanted the old behavior you would call model.generate("[Input 1][Output 1][Input 2]") in the second call, but if you didn't you could still do model.generate("[Input 2]"). It's fairly cheap to process a buffer of tokens in Python and analyze it for potential reuse patterns.
As far as the reuse logic, I have developed the outline for a little algorithm that I think will work in most cases.
Your suggestions sound reasonable. We'll start with an option to slice inference session (reuse_inference(old[start:end])) - I hope to add it in the nearest releases.
Based on a discussion on Discord between me and @borzunov in the webui discussion, I didn't want it to get lost.
So consider a simple program using sessions:
Currently the way the session API works is that it keeps history, so the first command generates
[Output 1]
and then the second command generates starting from[Input 1][Output 1][Input 2]
. But compared to the usual transformers API this is quite restrictive, it is really only useful for chat-like applications where you can never go back and edit anything.It would be a more powerful API if instead the .generate() calls acted as though they were unrelated/independent, and then the session managed the reuse logic internally. So for example if you wanted the old behavior you would call
model.generate("[Input 1][Output 1][Input 2]")
in the second call, but if you didn't you could still domodel.generate("[Input 2]")
. It's fairly cheap to process a buffer of tokens in Python and analyze it for potential reuse patterns.As far as the reuse logic, I have developed the outline for a little algorithm that I think will work in most cases.
This supports three main use cases:
a
thenab
will reuse the blocksab
thenac
will reusea
abc
thenbcd
will dropa
then reusebc
, which can happen as you get long prompts that exceed the context lengthPer Alexander B., this would actually be fairly easy to implement in petals, but currently, it is not yet implemented.
The text was updated successfully, but these errors were encountered: