Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/re-implement the memory condenser #1771

Closed
wants to merge 32 commits into from

Conversation

enyst
Copy link
Collaborator

@enyst enyst commented May 13, 2024

This PR proposes a re-implementation of the short term memory condense process, currently in use for monologue agent. Should address @li-boxuan comments on the previous PR.

CC: @xingyaoww I did not check CodeAct, this PR is basically continuing to rewrite / refactor the monologue stuff a little more generically.

Goals:

  • a little better condense - e.g.
    - multiple attempts to condense to avoid ContextWindowLimitError
    - only do recent events, not the default actions always included in the prompt
    - do more on earlier memories
  • refactor (monologue) prompts to split:
    - default events ('thoughts' that the monologue agent learns in the initial prompt and since then they're supposed to be in every prompt)
    - recent events (current session). Only recent events should be summarized, monologue was doing them all.
  • an interface for agents to use the memory condenser
    • a simple interface: condense(), needs_condense(), get_token_limit()
    • not happy with some of it though, for example init needs Callables which assume knowledge of tasks and background commands, and tasks aren't even used at all; should more likely be some generate_prompt() which give only what is needed, or just the size of the relevant prompts.
  • refactor monologue to move out all of the code for condensing, e.g. parsing summaries, checking tokens

@enyst enyst marked this pull request as draft May 13, 2024 23:35
@enyst enyst changed the title [WIP] Fix/re-implement the memory condenser Fix/re-implement the memory condenser May 14, 2024
@enyst enyst marked this pull request as ready for review May 14, 2024 22:18
@li-boxuan
Copy link
Collaborator

Should address @li-boxuan comments on the previous PR.

Could u please remind me what the previous PR was 🤣 I have a very short memory and cannot wait to fix my brain with your condenser

@enyst
Copy link
Collaborator Author

enyst commented May 14, 2024

Oh just this bit.
#1614 (comment)

@enyst
Copy link
Collaborator Author

enyst commented May 15, 2024

All clean after restore, and they don't need anything, it seems. Monologue had changes only to the summarize prompt, which is not used.

The variables were in toml this time:

[core]
workspace_base="/Users/enyst/repos/workplace"
workspace_mount_path="/Users/enyst/repos/workplace"
workspace_mount_path_in_sandbox="/workplace"

It looks to me like just good old PEBKAC. I should have had a clean environment, I believe you documented that somewhere. 😊

@li-boxuan
Copy link
Collaborator

li-boxuan commented May 15, 2024

I should have had a clean environment

That would be great but shouldn't be mandatory. I'll check if the script doesn't mask any environmental variable here. Thanks for the information!

Aha yeah we didn't mask WORKSPACE_MOUNT_PATH_IN_SANDBOX variable in the script. I'll fix it!

UPDATE: already fixed!

Copy link
Collaborator

@li-boxuan li-boxuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome to me in general! I notice you retry 3 times when attempting to condense. I wonder where the empirical number 3 is from.

Is there any chance you can show a demo of the new memory condenser? Not sure how hard it would be to add a new integration test, but screenshots alone would also be nice.

agenthub/monologue_agent/agent.py Outdated Show resolved Hide resolved
opendevin/memory/condenser.py Outdated Show resolved Hide resolved
opendevin/memory/condenser.py Outdated Show resolved Hide resolved
Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>
@enyst
Copy link
Collaborator Author

enyst commented May 17, 2024

@li-boxuan Re: your question about 3 times: good question! My thinking is like this: we have a set of events, they're too large, and the last few matter. Constraint: the summarize prompt might (and does, in practice!) run into ContextWindowExceeded itself, if it's sent with all recent events.

The PR is attempting two naive approaches to make it work better:

  • condense half, repeat if still necessary. I tested with 5 times, and reduced it. If the need for condense happened when there were 30 events (which can happen, though not with GPT-4-1106 or the like), then 3 is almost too many times, because in the worse case (each half resulted in 1-2 summary events; but it still needs), we're left with very few fully detailed.
  • just tell the LLM to prioritize later events. I'm not sure if that works well with the PR prompt. 😢 I think it stopped being worse than the current prompt. IMHO it has the potential to work better than some naive in-code approach, and I've considered to drop the first and just keep and iterate on this one.

Still, I think we need something to deal with the possibility that summarizing all events in the same time errors out. Even if it's only 1 round with a subset. This PR doesn't attempt to make a good summarization, just make it work more times than it does. (and take some more stuff out of monologue)

I don't know, perhaps @rbren or @xingyaoww can tell if this PR makes it better or worse, or should we just cut monologue out until the condensing process is ready for CodeAct and then we see, or some other alternative.

Side note:
What this PR also doesn't do, since it doesn't touch the last events, is deal with a funny case: the last observation was significantly larger than the rest. Like when GPT-4 escaped on the internet and came back with more than 80k tokens - good times! It doesn't deal with it because that case should be handled now, we have elsewhere code that trims the output to a low number of characters (hard-coded 5000 for monologue I believe); and we have also code that removes "heavy stuff" like screenshots that weren't supposed to be sent anyway. That should work. Give or take that I've seen some weird things, that I can't replicate, like output from docker spilling over... somehow. Perhaps it was before all these other measures were put in place.

@li-boxuan
Copy link
Collaborator

li-boxuan commented May 18, 2024 via email

@rbren
Copy link
Collaborator

rbren commented May 18, 2024

This LGTM overall!

We'll probably need to iterate on this a bunch, but the "condense the first half" strategy seems like a good idea.

@enyst enyst requested a review from xingyaoww May 18, 2024 18:23
Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for getting back so late - a lot of backlogs to catch up in the last few days!

I really like the general idea, and it pretty much aligns with what I think about this problem! Thanks a ton for implementing this!!!

Here's some of my thoughts that could make the whole condenser more general and directly applicable to CodeAct (i mentioned some of them in the detailed review). It will be nice if we can do this generally and plug into CodeAct:

Memory condenser can be lazier

The memory condenser can default to act in a lazy manner (e.g., only work when there's a ctx window limit error throw by the API). Because each OSS model might uses different tokenizers under the hood, so it is probably hard to accurately estimate the number of tokens ahead of time (e.g., hard to know which model the user is using, since they may use an OpenAI-compatible API they host themselves, with non-description model name) - so it is better be lazy to be able to catch all the ctx errors.

In addition to "condense when ctx window error", the user can optionally specify the max context limit (e.g., 32k) as an config.toml variable to save costs. We don't need to estimate the number of tokens so accurately (as in the previous case) to do this -- we can just use one tokenizer (e.g., Llama or GPT's tiktoken) to get an rough estimate to control the cost.

Potential ways to simplify the interface

I think there's some chance that we might be able to simplify this condenser interface. The input to the condenser will be an LLM + a list of "events". Each element of "events" can have an attribute called should_condense, and it should be set to False for every important message (e.g., SYSTEM message, initial instruction; intermediate user messages should be kept as much as possible, but can be condensed - so maybe an idea of priority is needed). The output of the condenser will also be a list of "events", which is shorter than the input events.

Potential ways to make the condense process more general

Here's how I think about the condensation process.

The example input can be something like:

USER: Plot a figure y=x using matplotlib for me.

AGENT (CmdRunAction): Sure, let me PIP install
<execute_bash>
pip install matplotlib
</execute_bash>

ENV (CmdRunObservation): [a bunch of logs]

AGENT (IPythonRunAction): Let me restart the jupyter kernel
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [success]

AGENT (IPythonRunAction): Let me plot the figure
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [figure]

AGENT (AgentMessage): Done! What's next

IMO The heuristic for condensation could be:

  1. It is important to keep messages from the beginning (system message & initial instruction should always be kept)
  2. More recent history is probably more important (e.g., what you attempted recently) than actions the agent attempted a few turns ago.
  3. You can try to lazily summarize things between the initial instruction AND the most recent message -- you should start with turns as early as possible (i.e., first turn with should_condense=True) for condensation.
  4. We can create a new Observation called SummarizedObservation, which is similar to a Message that contains a textual summary of what the agent did and what the agent observed.
  5. We should split the list of events into multiple chunks. Each chunk MUST start with an Action with source='agent' (agent do something) - this is essentially the delimiter. This means we can condense each chunk into one sentence: "The agent did XXX, and got YYY as a response".
  6. The minimal unit of summarization can be such a chunk: the agent did something, then observed something. That is, you can summarize a list of events (a chunk) into one SummarizedObservation. In this way, as long as the agent can support converting SummarizedObservation into user and/or assistant messages for chat_completion API, they can naturally support the memory condenser and get all the benefits!

Based on the previous trajectory, here's an example of running condense ONCE:

The example input can be something like:

USER: Plot a figure y=x using matplotlib for me.

---CONDENSED CONTENT
AGENT: "[... Performed some actions ...]"
USER (SummarizedObservation): 
You have successfully use `pip` to install the `matplotlib`. The output log looks pretty successful.
[[TODO: We can also save all these intermediate events before "condensation" to a directory inside sandbox, 
so that we can also include a message here like "the output log is in `/opendevin/logs/turn_2.log`"
so the agent can actually look back for details when needed (long term memory!)]]
---

AGENT (IPythonRunAction): Let me restart the jupyter kernel
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [success]

AGENT (IPythonRunAction): Let me plot the figure
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [figure]

AGENT (AgentMessage): Done! What's next

If after ONE condense run, it still throws an Context Limit error, we can do it again (by append summarization from the next "chunk"):

The example input can be something like:

USER: Plot a figure y=x using matplotlib for me.

---CONDENSED CONTENT---
AGENT: "[... Performed some actions ...]"
USER (SummarizedObservation): 
1. You have successfully use `pip` to install the `matplotlib`. The output log looks pretty successful.
2. You have successfully restarted the jupyter kernel in order to make `matplotlib` to be importable in Jupyter notebook. Return value suggests success
---

AGENT (IPythonRunAction): Let me plot the figure
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [figure]

AGENT (AgentMessage): Done! What's next

This is a very long comment ;) - Always happy to discuss more!

summary_response = self.memory_condenser.condense(
summarize_prompt=prompt, llm=self.llm
# summarize the short term history (events) if necessary
if self.memory_condenser.needs_condense(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking if we can go through a lazy approach: Only involve the condenser if the chat_completion API throws a context window error (e.g., error message contains context window - then we cast the error into a ContextWindowException, then involve the condenser?).

The issue with litellm tokenizer is that, they probably don't properly support OSS LLM, which might have different tokenizers and we only know we exceed the context window when we issued out chat_completion requests and receive exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! Litellm has a ContextWindowExceededError that it maps to quite many providers: https://litellm.vercel.app/docs/exception_mapping#custom-mapping-list . It includes hugging face and openai compatible providers, and it doesn't include ollama. What they mean by openai compatible providers appears to be a fairly large list of local LLMs, served via LMStudio or some software like that.

litellm uses hugging_face tokenizer or openai tokenizer. And wow, now that I'm looking to it again, just two weeks ago they added support for a custom hugging face tokenizer that the client code could pass over: BerriAI/litellm@2fd2e81

Totally agree anyway, I'm making this happen on error, or not use the tokenizer check by default.


Parameters:
- llm (LLM): llm to be used for summarization
- llm: LLM to be used for summarization.
- default_events: List of default events that should remain unchanged.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these default_events will be the "prefix" -- i.e., the thing that should never change should be the system message + the initial instruction? Or do you think if it is better if we just pass in a list of events, and just add an additional attribute like can_summarize to False for these important messages - This way we can combine default_events and recent_events into one list events and simplifies the interface a bit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'm looking at CodeAct, and these default_events are kind of what CodeAct has as EXAMPLES, and maybe similar to GITHUB_MESSAGE and COMMAND_DOCS. For CodeAct these are included in system prompt and that's it.

There is one difference though, for Monologue, they are Events:

  • it has two lists of "first lessons"
  • one is parsed as actual events default_events("initial thoughts"= hey it's waking up in this docker world 😄)
  • the other is documentation of actions it knows
  • they both are supposed to be in the "prefix" of sorts, that is, in the action prompt at all times.

Maybe this particular Monologue list is not even needed in the summarize prompt, only action prompt... I thought it makes sense to give the LLM context with examples here too, but I'm not too sure.

On the other hand, I was also thinking that it's possible that we add others, not hard-coded in the prompts, but obtained at runtime from another place ("action library", "skills" library). That may look like the GITHUB_MESSAGE, with or without an example. I thought that for monologue, that means it could also add some actions/observations to its "real-like" default_events list.

opendevin/memory/condenser.py Outdated Show resolved Hide resolved
opendevin/memory/condenser.py Show resolved Hide resolved
agenthub/monologue_agent/utils/prompts.py Show resolved Hide resolved
@xingyaoww
Copy link
Collaborator

Btw, ^^If this is completed. I think this PR should have the potential to solve #1748

@SmartManoj SmartManoj mentioned this pull request May 19, 2024
2 tasks
@neubig
Copy link
Contributor

neubig commented May 23, 2024

Hey @enyst , could you ping me when the conflicts are resolved and I'll take a look?

@enyst
Copy link
Collaborator Author

enyst commented May 23, 2024

@neubig Thank you. I have a WIP to implement the behavior suggested by @xingyaoww and apply it for CodeAct. I think it's best to close this one, and open a new PR, from a branch that was restarted almost clean with a patch applied to main. (I'll also move it to a branch in the main repo)

This was initially made with monologue in mind, but I don't know, applying the older version won't really work as such because monologue is also cut off from condenser atm, since we've updated it to not store its own event history here #1863. We can re-attach it when done, and summarized events/messages are restored in the event stream. (I think?)

This got so interesting when I attempted to apply these insights into how it could work. I feel it changes deeply both the process and even whether the summary is an Action or an Observation 😅. The best part is, CodeAct itself is working on it!

@neubig
Copy link
Contributor

neubig commented May 23, 2024

Cool, we can close this, but that's exciting!

@neubig neubig closed this May 23, 2024
@enyst enyst mentioned this pull request May 23, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants