Support Multimodal Input for Agents #1914

mczhuge · 2024-05-20T13:10:11Z

Hi, @neubig @xingyaoww @li-boxuan or any brother who are interesting,

I am currently preparing for MLAgentBench evaluation and supporting the GAIA evaluation.

@Jiayi-Pan, please see GPTSwarm, where these multimodal readers are crucial for enhancing GAIA's performance.

Need help with

To maintain consistency, could someone familiar with the project use litellm to replace the current OpenAI API call in this file?

Next steps

@Jiayi-Pan, we can benchmark GAIA together. I have the datasets and a primary agent for GAIA available here.

xingyaoww

I think we can actually keep a lot of these good tools (e.g., ImageReader, VideoReader) and remove some of them (e.g., zip reader - agent can just unzip in a bash shell). This is probably a good time for us to build the "agent skill library"

xingyaoww · 2024-05-20T14:33:42Z

opendevin/core/utils/readers.py

+        """To be overriden by the descendant class"""
+
+
+class TXTReader(Reader):


Do these mean we need to expose a lot of tools to the model for these evals?
Maybe we could pack them into plugins and prompt the model to use them - maybe only for evaluation for now (before the model is really good at perceiving multimodal inputs directly)?

I feel like (eventually) they should be micro-agents. Prompting the model about the usage of so many (and will continue to grow, hypothetically) tools would be challenging.

Based on current usage, the first step is to consider these as interfaces for reading multimodal files. The next step is to integrate them into plugins. However, I personally prefer to create a unified reader action module that is strongly linked to actions and observations (no harsh for this). What's your opinion?

xingyaoww · 2024-05-20T15:23:02Z

opendevin/core/utils/readers.py

+                transcript = client.audio.translations.create(
+                    model='whisper-1', file=audio_file
+                )
+            return transcript.text


We probably need a good way to track the cost for this too...

if change to litellm, it may easy to track?

rbren · 2024-05-20T15:33:37Z

Looks neat! A few things on structure:

dependencies should go in pypackage.yml
this file should probably go in opendevin/llm/readers.py
we should probably make this functionality available to the LLM class somehow

rbren

See requests on structure above

Shimada666 · 2024-05-20T16:53:08Z

tomorrow i will trying replace openai api call with litellm

mczhuge · 2024-05-20T17:34:42Z

tomorrow i will trying replace openai api call with litellm

Thanks so much!

mczhuge · 2024-05-20T17:35:35Z

dependencies should go in pypackage.yml

this file should probably go in opendevin/llm/readers.py

we should probably make this functionality available to the LLM class somehow

Yes @rbren I am still getting more familiar with the abstraction of the whole OpenDevin system.

opendevin/core/utils/readers.py

yufansong · 2024-05-20T18:43:31Z

opendevin/core/utils/readers.py

+# TODO: Double-check where the interface for the stored API key is in OpenDevin.
+# TODO: Or ask someone familiar to change it with litellm.
+OPENAI_API_KEY = 'PUT_YOUR_OPEN_AI_API'


You can take here and here as a reference.

Take a look, perfer to solve it in future. Because we are not sure wehther we will change it into litellm.

I agree. I simple use the config to get the API now.

xingyaoww · 2024-05-21T11:10:15Z

An experimental idea: We can include all these tools (e.g., open pdfs, etc) to the agentskills library: #1941

yufansong

I do some help to make sure this PR will not block other PRs progress.

Move the file into opendevin/llm/reader.py
Move the installation into poetry pyproject.toml and solve the conflict with main branch.
For cost, we also need add cost in other parts. So prefer to solve them together in other PR.
May discuss whether we will shift to litellm

@xingyaoww @li-boxuan @rbren WDYT

Shimada666 · 2024-05-21T14:19:46Z

@mczhuge @yufansong

I do some help to make sure this PR will not block other PRs progress.我做了一些帮助，以确保这个 PR 不会阻碍其他 PR 的进展。

Move the file into opendevin/llm/reader.py将文件移至 opendevin/llm/reader.py

Move the installation into poetry pyproject.toml and solve the conflict with main branch.将装置移入诗歌 pyproject.toml ，解决与主分支的冲突。

For cost, we also need add cost in other parts. So prefer to solve them together in other PR.对于成本，我们还需要在其他部分增加成本。所以更愿意在其他 PR 中一起解决它们。

May discuss whether we will shift to litellm可能会讨论我们是否会转向 litellm

@xingyaoww @li-boxuan @rbren WDYT WDYT公司

Oh, I've done some work to replace the OpenAI API call with litellm, and I've also done some code optimizations. Please discuss whether we will shift to litellm! And if litellm is confirmed to be needed, I will submit my code after this PR is merged.

mczhuge · 2024-05-21T14:44:25Z

@rbren Seems this PR is ready, and @Shimada666 will further improve after merging.

li-boxuan

This code will reside in our core package, we really need some unit tests. It shouldn't be hard to do so for many readers here. I am okay with doing it later as long as we do it eventually...

I added an unit test as an example.

opendevin/llm/readers.py

poetry.lock

Shimada666 · 2024-05-22T16:31:27Z

It seems there are some conflicts that need to be resolved.
I hope this PR can be merged soon. I'm eager to contribute! 😉

mczhuge · 2024-05-22T17:43:38Z

It seems there are some conflicts that need to be resolved.
I hope this PR can be merged soon. I'm eager to contribute! 😉

Me too. Hope this PR could be merged soon.

mczhuge · 2024-05-22T17:46:44Z

It seems there are some conflicts that need to be resolved.
I hope this PR can be merged soon. I'm eager to contribute! 😉

Solved the conflicts.

mczhuge

Looks good now.

rbren

This is really neat functionality! But we need to think through this a bit more thoroughly.

It looks like these readers are reading from paths on the local disk, where the API server is running. They should really be reading files via the runtime, which has access to the user's workspace. Otherwise we'll end up with problems when e.g. the user is using E2B or another remote runtime. Not to mentions security concerns.

For the multimodal cases (e.g. images) we'll need to figure out a better way to stick their output into the EventStream (maybe just the b64 image?), and how to let the agent pass the data to the LLM (which you're getting at with prepare_api_call.

I'm imagining a flow like this:

Agent sends a readFile(path="foo.pdf") action
We read the bytes of the file from the runtime
We do any post-processing of the bytes using these readers (e.g. extracting PDF text)
We put the resulting FileReadObservation into the event stream

If there's a FileReadObservation for e.g. a PNG image, the agent would be responsible for doing the work currently in prepare_api_call--we should probably put helper functions in the LLM class (e.g. get_message_for_image(b64: string))

neubig · 2024-05-22T18:45:55Z

we'll need to figure out a better way to stick their output into the EventStream (maybe just the b64 image?)

Note that this is already what we do with browser screenshots I believe, so we can probably do the same thing for images read from disk.

xingyaoww · 2024-05-23T01:26:29Z

Another alternative is that we can port these multi-modal readers into the agentskills library: #1941

That it:

Get Implement agentskills for OpenDevin to helpfully improve edit AND including more useful tools/skills #1941 merged
I can help revise this PR into agentskills

It won't take too much work for now (just copy to a different file and tweak the prompt) - then we can systematically refractor the whole agentskills library to be compatible with EventStream (e.g., send structured information to the EventStream from the sandbox code execution) - it is the next step of agentskills anyway. How does that sounds? It would help unblock the integration of these GAIA benchmarks.

Jiayi-Pan · 2024-05-23T04:46:46Z

This is awesome! This looks like a good starting point to get multi-modal understanding / gaia agent started.

Kind of horizontal to this, I do wonder if we should also consider a general approach in the long run (see discussions #1911 between @frankxu2004 @li-boxuan and myself).

Basically, once we enable the agent to directly read browser's pixel observation space, it indirectly solves image, video, pdf understanding, since agent can just open the file in the browser and read it.

For other files like docx, json, since our agent is capable of installing pacakages and have a jupyter repl interface, it can automatically figure out how to do so.

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself.
The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

xingyaoww · 2024-05-23T04:59:09Z

@Jiayi-Pan agreed! For the time being (i mean for Neurips submission) we can assume agent can only perceive multi-modal via tools (e.g., defined in this PR). But as a natural next step, we should allow the agent to directly perceive raw pixels from browser, which would "solves image, video, pdf understanding, since agent can just open the file in the browser and read it." as you described.

frankxu2004 · 2024-05-23T05:36:38Z

Agreed. In the long run we can probably eventually rely on OS or browser interface to handle these observations (audio capture, anyone?), without having to hand craft too much about tools. But to make things practical, creating these programmatic tools are useful for now!

mczhuge · 2024-05-23T07:15:38Z

Considering the current situation I think it is a good starting point to first use current reader and link them into action and observation and then improve these functions iteratively.

Considering the concept of a 'pixel observation space', I believe it can become feasible with the release of GPT-5 (or more powerful LLMs). Currently, translating any modality simply by observing can lead to uncontrollable hallucinations and loss of information.

neubig · 2024-05-23T15:54:00Z

Thanks for all the discussion everyone! Do we know what the next steps are? Asking just because it's kinda time sensitive.

Shimada666 · 2024-05-23T17:10:13Z

@neubig
As @xingyaoww mentioned, this is a good opportunity to build the agent skill library, so I have integrated the capabilities of some readers into agent skill as part of this PR. However, the decision on which readers to retain still requires input from everyone. I hope to get everyone's suggestion! See PR: #2016

xingyaoww · 2024-05-25T11:00:23Z

Thanks @mczhuge for contributing these multi-modal readers! #2016 refactored those and included them into agentskills -- let's build on top of these tools for a strong GAIA agent :)

suport reading multimodal files

da3bc8e

xingyaoww reviewed May 20, 2024

View reviewed changes

rbren requested changes May 20, 2024

View reviewed changes

yufansong mentioned this pull request May 20, 2024

Support GAIA benchmark #1911

Merged