Fix/re-implement the memory condenser #1771

enyst · 2024-05-13T23:35:52Z

This PR proposes a re-implementation of the short term memory condense process, currently in use for monologue agent. Should address @li-boxuan comments on the previous PR.

CC: @xingyaoww I did not check CodeAct, this PR is basically continuing to rewrite / refactor the monologue stuff a little more generically.

Goals:

a little better condense - e.g.
- multiple attempts to condense to avoid ContextWindowLimitError
- only do recent events, not the default actions always included in the prompt
- do more on earlier memories
refactor (monologue) prompts to split:
- default events ('thoughts' that the monologue agent learns in the initial prompt and since then they're supposed to be in every prompt)
- recent events (current session). Only recent events should be summarized, monologue was doing them all.
an interface for agents to use the memory condenser
- a simple interface: condense(), needs_condense(), get_token_limit()
- not happy with some of it though, for example init needs Callables which assume knowledge of tasks and background commands, and tasks aren't even used at all; should more likely be some generate_prompt() which give only what is needed, ~~or just the size of the relevant prompts~~.
refactor monologue to move out all of the code for condensing, e.g. parsing summaries, checking tokens

add tests

…emories-condense

…er memories than earlier (Claude did it!)

li-boxuan · 2024-05-14T22:39:47Z

Should address @li-boxuan comments on the previous PR.

Could u please remind me what the previous PR was 🤣 I have a very short memory and cannot wait to fix my brain with your condenser

enyst · 2024-05-14T22:43:32Z

Oh just this bit.
#1614 (comment)

enyst · 2024-05-15T08:42:10Z

All clean after restore, and they don't need anything, it seems. Monologue had changes only to the summarize prompt, which is not used.

The variables were in toml this time:

[core]
workspace_base="/Users/enyst/repos/workplace"
workspace_mount_path="/Users/enyst/repos/workplace"
workspace_mount_path_in_sandbox="/workplace"

It looks to me like just good old PEBKAC. I should have had a clean environment, I believe you documented that somewhere. 😊

li-boxuan · 2024-05-15T16:51:57Z

I should have had a clean environment

That would be great but shouldn't be mandatory. I'll check if the script doesn't mask any environmental variable here. Thanks for the information!

Aha yeah we didn't mask WORKSPACE_MOUNT_PATH_IN_SANDBOX variable in the script. I'll fix it!

UPDATE: already fixed!

li-boxuan

This looks awesome to me in general! I notice you retry 3 times when attempting to condense. I wonder where the empirical number 3 is from.

Is there any chance you can show a demo of the new memory condenser? Not sure how hard it would be to add a new integration test, but screenshots alone would also be nice.

agenthub/monologue_agent/agent.py

opendevin/memory/condenser.py

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

enyst · 2024-05-17T23:24:17Z

@li-boxuan Re: your question about 3 times: good question! My thinking is like this: we have a set of events, they're too large, and the last few matter. Constraint: the summarize prompt might (and does, in practice!) run into ContextWindowExceeded itself, if it's sent with all recent events.

The PR is attempting two naive approaches to make it work better:

condense half, repeat if still necessary. I tested with 5 times, and reduced it. If the need for condense happened when there were 30 events (which can happen, though not with GPT-4-1106 or the like), then 3 is almost too many times, because in the worse case (each half resulted in 1-2 summary events; but it still needs), we're left with very few fully detailed.
just tell the LLM to prioritize later events. I'm not sure if that works well with the PR prompt. 😢 I think it stopped being worse than the current prompt. IMHO it has the potential to work better than some naive in-code approach, and I've considered to drop the first and just keep and iterate on this one.

Still, I think we need something to deal with the possibility that summarizing all events in the same time errors out. Even if it's only 1 round with a subset. This PR doesn't attempt to make a good summarization, just make it work more times than it does. (and take some more stuff out of monologue)

I don't know, perhaps @rbren or @xingyaoww can tell if this PR makes it better or worse, or should we just cut monologue out until the condensing process is ready for CodeAct and then we see, or some other alternative.

Side note:
What this PR also doesn't do, since it doesn't touch the last events, is deal with a funny case: the last observation was significantly larger than the rest. Like when GPT-4 escaped on the internet and came back with more than 80k tokens - good times! It doesn't deal with it because that case should be handled now, we have elsewhere code that trims the output to a low number of characters (hard-coded 5000 for monologue I believe); and we have also code that removes "heavy stuff" like screenshots that weren't supposed to be sent anyway. That should work. Give or take that I've seen some weird things, that I can't replicate, like output from docker spilling over... somehow. Perhaps it was before all these other measures were put in place.

li-boxuan · 2024-05-18T00:07:52Z

Thank you! Yeah that’s a very interesting and challenging topic, and I wonder if we should look at some recent research works on it. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Engel Nyst ***@***.***> Sent: Friday, May 17, 2024 4:24:40 PM To: OpenDevin/OpenDevin ***@***.***> Cc: ***@***.*** ***@***.***>; Mention ***@***.***> Subject: Re: [OpenDevin/OpenDevin] Fix/re-implement the memory condenser (PR #1771) @li-boxuan<https://github.com/li-boxuan> Re: your question about 3 times: good question! My thinking is like this: we have a set of events, they're too large, and the last few matter. Constraint: the summarize prompt might (and does, in practice!) run into ContextWindowExceeded itself, if it's sent with all recent events. The PR is attempting two naive approaches to make it work better: * condense half, repeat if still necessary. I tested with 5 times, and reduced it. If the need for condense happened when there were 30 events (which can happen, though not with GPT-4-1106 or the like), then 3 is almost too many times, because in the worse case (each half resulted in 1-2 summary events; but it still needs), we're left with very few fully detailed. * just tell the LLM to prioritize later events. I'm not sure if that works well with the PR prompt. 😢 I think it stopped being worse than the current prompt. IMHO it has the potential to work better than some naive in-code approach, and I've considered to drop the first and just keep and iterate on this one. Still, I think we need something to deal with the possibility that summarizing all events in the same time errors out. Even if it's only 1 round with a subset. This PR doesn't attempt to make a good summarization, just make it work more times than it does. (and take some more stuff out of monologue) I don't know, perhaps @rbren<https://github.com/rbren> or @xingyaoww<https://github.com/xingyaoww> can tell if this PR makes it better or worse, or should we just cut monologue out until the condensing process is ready for CodeAct and then we see, or some other alternative. Side note: What this PR also doesn't do, since it doesn't touch the last events, is deal with a funny case: the last observation was significantly larger than the rest. Like when GPT-4 escaped on the internet and came back with more than 80k tokens - good times! It doesn't deal with it because that case should be handled now, we have elsewhere code that trims the output to a low number of characters (hard-coded 5000 for monologue I believe); and we have also code that removes "heavy stuff" like screenshots that weren't supposed to be sent anyway. That should work. Give or take that I've seen some weird things, that I can't replicate, like output from docker spilling over... somehow. Perhaps it was before all these other measures were put in place. — Reply to this email directly, view it on GitHub<#1771 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGENUWUSZ7HHRHN5OBGJ6CDZC2GTRAVCNFSM6AAAAABHVDD4FSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGQ4DGOBWGY>. You are receiving this because you were mentioned.Message ID: ***@***.***>

rbren · 2024-05-18T16:04:23Z

This LGTM overall!

We'll probably need to iterate on this a bunch, but the "condense the first half" strategy seems like a good idea.

xingyaoww

Sorry for getting back so late - a lot of backlogs to catch up in the last few days!

I really like the general idea, and it pretty much aligns with what I think about this problem! Thanks a ton for implementing this!!!

Here's some of my thoughts that could make the whole condenser more general and directly applicable to CodeAct (i mentioned some of them in the detailed review). It will be nice if we can do this generally and plug into CodeAct:

Memory condenser can be lazier

The memory condenser can default to act in a lazy manner (e.g., only work when there's a ctx window limit error throw by the API). Because each OSS model might uses different tokenizers under the hood, so it is probably hard to accurately estimate the number of tokens ahead of time (e.g., hard to know which model the user is using, since they may use an OpenAI-compatible API they host themselves, with non-description model name) - so it is better be lazy to be able to catch all the ctx errors.

In addition to "condense when ctx window error", the user can optionally specify the max context limit (e.g., 32k) as an config.toml variable to save costs. We don't need to estimate the number of tokens so accurately (as in the previous case) to do this -- we can just use one tokenizer (e.g., Llama or GPT's tiktoken) to get an rough estimate to control the cost.

Potential ways to simplify the interface

I think there's some chance that we might be able to simplify this condenser interface. The input to the condenser will be an LLM + a list of "events". Each element of "events" can have an attribute called should_condense, and it should be set to False for every important message (e.g., SYSTEM message, initial instruction; intermediate user messages should be kept as much as possible, but can be condensed - so maybe an idea of priority is needed). The output of the condenser will also be a list of "events", which is shorter than the input events.

Potential ways to make the condense process more general

Here's how I think about the condensation process.

The example input can be something like:

USER: Plot a figure y=x using matplotlib for me.

AGENT (CmdRunAction): Sure, let me PIP install
<execute_bash>
pip install matplotlib
</execute_bash>

ENV (CmdRunObservation): [a bunch of logs]

AGENT (IPythonRunAction): Let me restart the jupyter kernel
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [success]

AGENT (IPythonRunAction): Let me plot the figure
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [figure]

AGENT (AgentMessage): Done! What's next

IMO The heuristic for condensation could be:

It is important to keep messages from the beginning (system message & initial instruction should always be kept)
More recent history is probably more important (e.g., what you attempted recently) than actions the agent attempted a few turns ago.
You can try to lazily summarize things between the initial instruction AND the most recent message -- you should start with turns as early as possible (i.e., first turn with should_condense=True) for condensation.
We can create a new Observation called SummarizedObservation, which is similar to a Message that contains a textual summary of what the agent did and what the agent observed.
We should split the list of events into multiple chunks. Each chunk MUST start with an Action with source='agent' (agent do something) - this is essentially the delimiter. This means we can condense each chunk into one sentence: "The agent did XXX, and got YYY as a response".
The minimal unit of summarization can be such a chunk: the agent did something, then observed something. That is, you can summarize a list of events (a chunk) into one SummarizedObservation. In this way, as long as the agent can support converting SummarizedObservation into user and/or assistant messages for chat_completion API, they can naturally support the memory condenser and get all the benefits!

Based on the previous trajectory, here's an example of running condense ONCE:

The example input can be something like:

USER: Plot a figure y=x using matplotlib for me.

---CONDENSED CONTENT
AGENT: "[... Performed some actions ...]"
USER (SummarizedObservation): 
You have successfully use `pip` to install the `matplotlib`. The output log looks pretty successful.
[[TODO: We can also save all these intermediate events before "condensation" to a directory inside sandbox, 
so that we can also include a message here like "the output log is in `/opendevin/logs/turn_2.log`"
so the agent can actually look back for details when needed (long term memory!)]]
---

AGENT (IPythonRunAction): Let me restart the jupyter kernel
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [success]

AGENT (IPythonRunAction): Let me plot the figure
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [figure]

AGENT (AgentMessage): Done! What's next

If after ONE condense run, it still throws an Context Limit error, we can do it again (by append summarization from the next "chunk"):

The example input can be something like:

USER: Plot a figure y=x using matplotlib for me.

---CONDENSED CONTENT---
AGENT: "[... Performed some actions ...]"
USER (SummarizedObservation): 
1. You have successfully use `pip` to install the `matplotlib`. The output log looks pretty successful.
2. You have successfully restarted the jupyter kernel in order to make `matplotlib` to be importable in Jupyter notebook. Return value suggests success
---

AGENT (IPythonRunAction): Let me plot the figure
<execute_ipython>
XXX
</execute_ipython>

ENV (IPythonRunObservation): [figure]

AGENT (AgentMessage): Done! What's next

This is a very long comment ;) - Always happy to discuss more!

xingyaoww · 2024-05-19T06:16:11Z

agenthub/monologue_agent/agent.py

-            summary_response = self.memory_condenser.condense(
-                summarize_prompt=prompt, llm=self.llm
+        # summarize the short term history (events) if necessary
+        if self.memory_condenser.needs_condense(


I'm actually thinking if we can go through a lazy approach: Only involve the condenser if the chat_completion API throws a context window error (e.g., error message contains context window - then we cast the error into a ContextWindowException, then involve the condenser?).

The issue with litellm tokenizer is that, they probably don't properly support OSS LLM, which might have different tokenizers and we only know we exceed the context window when we issued out chat_completion requests and receive exception.

OK! Litellm has a ContextWindowExceededError that it maps to quite many providers: https://litellm.vercel.app/docs/exception_mapping#custom-mapping-list . It includes hugging face and openai compatible providers, and it doesn't include ollama. What they mean by openai compatible providers appears to be a fairly large list of local LLMs, served via LMStudio or some software like that.

litellm uses hugging_face tokenizer or openai tokenizer. And wow, now that I'm looking to it again, just two weeks ago they added support for a custom hugging face tokenizer that the client code could pass over: BerriAI/litellm@2fd2e81

Totally agree anyway, I'm making this happen on error, or not use the tokenizer check by default.

xingyaoww · 2024-05-19T06:19:17Z

opendevin/memory/condenser.py


        Parameters:
-        - llm (LLM): llm to be used for summarization
+        - llm: LLM to be used for summarization.
+        - default_events: List of default events that should remain unchanged.


Maybe these default_events will be the "prefix" -- i.e., the thing that should never change should be the system message + the initial instruction? Or do you think if it is better if we just pass in a list of events, and just add an additional attribute like can_summarize to False for these important messages - This way we can combine default_events and recent_events into one list events and simplifies the interface a bit?

I see, I'm looking at CodeAct, and these default_events are kind of what CodeAct has as EXAMPLES, and maybe similar to GITHUB_MESSAGE and COMMAND_DOCS. For CodeAct these are included in system prompt and that's it.

There is one difference though, for Monologue, they are Events:

it has two lists of "first lessons"

one is parsed as actual events default_events("initial thoughts"= hey it's waking up in this docker world 😄)

the other is documentation of actions it knows

they both are supposed to be in the "prefix" of sorts, that is, in the action prompt at all times.

Maybe this particular Monologue list is not even needed in the summarize prompt, only action prompt... I thought it makes sense to give the LLM context with examples here too, but I'm not too sure.

On the other hand, I was also thinking that it's possible that we add others, not hard-coded in the prompts, but obtained at runtime from another place ("action library", "skills" library). That may look like the GITHUB_MESSAGE, with or without an example. I thought that for monologue, that means it could also add some actions/observations to its "real-like" default_events list.

opendevin/memory/condenser.py

agenthub/monologue_agent/utils/prompts.py

xingyaoww · 2024-05-19T08:20:49Z

Btw, ^^If this is completed. I think this PR should have the potential to solve #1748

agenthub/monologue_agent/utils/prompts.py

…mories-condense

neubig · 2024-05-23T15:48:46Z

Hey @enyst , could you ping me when the conflicts are resolved and I'll take a look?

enyst · 2024-05-23T18:06:46Z

@neubig Thank you. I have a WIP to implement the behavior suggested by @xingyaoww and apply it for CodeAct. I think it's best to close this one, and open a new PR, from a branch that was restarted almost clean with a patch applied to main. (I'll also move it to a branch in the main repo)

This was initially made with monologue in mind, but I don't know, applying the older version won't really work as such because monologue is also cut off from condenser atm, since we've updated it to not store its own event history here #1863. We can re-attach it when done, and summarized events/messages are restored in the event stream. (I think?)

This got so interesting when I attempted to apply these insights into how it could work. I feel it changes deeply both the process and even whether the summary is an Action or an Observation 😅. The best part is, CodeAct itself is working on it!

neubig · 2024-05-23T18:18:17Z

Cool, we can close this, but that's exciting!

enyst added 11 commits May 13, 2024 10:37

[WIP] split recent events from core events; condense recent events

565699e

WIP split core and recent events for monologue

b166ec3

add test file

14c7040

adjust prompts for partial filling

bec4e97

adjust prompts for partial filling

8b81ce7

add tests

fix test

38ed4fa

Merge branch 'memories-condense' of github.com:enyst/OpenDevin into m…

1287a54

…emories-condense

fix condenser to work with the prompt changes

a5e92a3

fix condenser

d182108

fix after changing stuff

cbb06c1

debug note, docstrings

39a5c9a

enyst marked this pull request as draft May 13, 2024 23:35

neubig assigned enyst May 14, 2024

fix behavior, 3 attempts then fail

68f1710

enyst force-pushed the memories-condense branch from cb4eef8 to 68f1710 Compare May 14, 2024 18:21

enyst added 3 commits May 14, 2024 20:47

fix calls

a3cdbc5

fix recent events, docs

cb78253

add to the summarize prompt that it should keep more detail about lat…

b7901ca

…er memories than earlier (Claude did it!)

enyst force-pushed the memories-condense branch from a5d7130 to b7901ca Compare May 14, 2024 19:50

enyst added 4 commits May 14, 2024 21:53

Merge branch 'main' into memories-condense

41ce6ee

Merge branch 'main' into memories-condense

58401f5

attempt to fix what summaries get made

17fe93d

Merge branch 'main' into memories-condense

2aa663e

enyst changed the title ~~[WIP] Fix/re-implement the memory condenser~~ Fix/re-implement the memory condenser May 14, 2024

fix integration tests, remove test

2258515

enyst force-pushed the memories-condense branch from a379289 to 2258515 Compare May 14, 2024 22:17

enyst marked this pull request as ready for review May 14, 2024 22:18

private method

3b3b489

enyst force-pushed the memories-condense branch from b4aaef7 to c1a9120 Compare May 15, 2024 08:21

li-boxuan mentioned this pull request May 16, 2024

Refactor integration test framework and relieve the pain of regeneration #1818

Merged

li-boxuan reviewed May 17, 2024

View reviewed changes

agenthub/monologue_agent/agent.py Outdated Show resolved Hide resolved

opendevin/memory/condenser.py Outdated Show resolved Hide resolved

opendevin/memory/condenser.py Outdated Show resolved Hide resolved

Update agenthub/monologue_agent/agent.py

4c48858

Co-authored-by: Boxuan Li <liboxuan@connect.hku.hk>

enyst mentioned this pull request May 17, 2024

Refactor monologue and SWE agent to use the messages in state history #1863

Merged

clean up

304aaac

enyst force-pushed the memories-condense branch from 52b1ba1 to 304aaac Compare May 17, 2024 16:55

enyst requested a review from xingyaoww May 18, 2024 18:23

xingyaoww reviewed May 19, 2024

View reviewed changes

isavita reviewed May 19, 2024

View reviewed changes

agenthub/monologue_agent/utils/prompts.py Outdated Show resolved Hide resolved

SmartManoj mentioned this pull request May 19, 2024

[Bug]: Groq error 400 #1902

Open

2 tasks

isavita reviewed May 19, 2024

View reviewed changes

agenthub/monologue_agent/utils/prompts.py Show resolved Hide resolved

enyst added 6 commits May 19, 2024 16:29

Merge branch 'main' of https://github.com/OpenDevin/OpenDevin into me…

2524a51

…mories-condense

remove background_obs from condense

87fd96d

add a generic summarize prompt

b8944a4

add pre-filled action prompt to monologue

b808968

update monologue, condenser

46d40d5

wip logic

ecc633c

isavita approved these changes May 21, 2024

View reviewed changes

neubig closed this May 23, 2024

enyst mentioned this pull request May 23, 2024

Rewrite memory condenser #2021

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/re-implement the memory condenser #1771

Fix/re-implement the memory condenser #1771

enyst commented May 13, 2024 •

edited

li-boxuan commented May 14, 2024

enyst commented May 14, 2024

enyst commented May 15, 2024

li-boxuan commented May 15, 2024 •

edited

li-boxuan left a comment

enyst commented May 17, 2024

li-boxuan commented May 18, 2024 via email

rbren commented May 18, 2024

xingyaoww left a comment

xingyaoww May 19, 2024

enyst May 19, 2024

xingyaoww May 19, 2024

enyst May 19, 2024

xingyaoww commented May 19, 2024

neubig commented May 23, 2024

enyst commented May 23, 2024

neubig commented May 23, 2024

Fix/re-implement the memory condenser #1771

Fix/re-implement the memory condenser #1771

Conversation

enyst commented May 13, 2024 • edited

li-boxuan commented May 14, 2024

enyst commented May 14, 2024

enyst commented May 15, 2024

li-boxuan commented May 15, 2024 • edited

li-boxuan left a comment

Choose a reason for hiding this comment

enyst commented May 17, 2024

li-boxuan commented May 18, 2024 via email

rbren commented May 18, 2024

xingyaoww left a comment

Choose a reason for hiding this comment

Memory condenser can be lazier

Potential ways to simplify the interface

Potential ways to make the condense process more general

xingyaoww May 19, 2024

Choose a reason for hiding this comment

enyst May 19, 2024

Choose a reason for hiding this comment

xingyaoww May 19, 2024

Choose a reason for hiding this comment

enyst May 19, 2024

Choose a reason for hiding this comment

xingyaoww commented May 19, 2024

neubig commented May 23, 2024

enyst commented May 23, 2024

neubig commented May 23, 2024

enyst commented May 13, 2024 •

edited

li-boxuan commented May 15, 2024 •

edited