Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Multimodal Input for Agents #1914

Closed
wants to merge 15 commits into from

Conversation

mczhuge
Copy link
Contributor

@mczhuge mczhuge commented May 20, 2024

Hi, @neubig @xingyaoww @li-boxuan or any brother who are interesting,

I am currently preparing for MLAgentBench evaluation and supporting the GAIA evaluation.

@Jiayi-Pan, please see GPTSwarm, where these multimodal readers are crucial for enhancing GAIA's performance.

Need help with

To maintain consistency, could someone familiar with the project use litellm to replace the current OpenAI API call in this file?

Next steps

@Jiayi-Pan, we can benchmark GAIA together. I have the datasets and a primary agent for GAIA available here.

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually keep a lot of these good tools (e.g., ImageReader, VideoReader) and remove some of them (e.g., zip reader - agent can just unzip in a bash shell). This is probably a good time for us to build the "agent skill library"

"""To be overriden by the descendant class"""


class TXTReader(Reader):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these mean we need to expose a lot of tools to the model for these evals?
Maybe we could pack them into plugins and prompt the model to use them - maybe only for evaluation for now (before the model is really good at perceiving multimodal inputs directly)?

Copy link
Collaborator

@li-boxuan li-boxuan May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like (eventually) they should be micro-agents. Prompting the model about the usage of so many (and will continue to grow, hypothetically) tools would be challenging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on current usage, the first step is to consider these as interfaces for reading multimodal files. The next step is to integrate them into plugins. However, I personally prefer to create a unified reader action module that is strongly linked to actions and observations (no harsh for this). What's your opinion?

transcript = client.audio.translations.create(
model='whisper-1', file=audio_file
)
return transcript.text
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need a good way to track the cost for this too...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if change to litellm, it may easy to track?

@rbren
Copy link
Collaborator

rbren commented May 20, 2024

Looks neat! A few things on structure:

  • dependencies should go in pypackage.yml
  • this file should probably go in opendevin/llm/readers.py
  • we should probably make this functionality available to the LLM class somehow

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See requests on structure above

@Shimada666
Copy link
Contributor

tomorrow i will trying replace openai api call with litellm

@mczhuge
Copy link
Contributor Author

mczhuge commented May 20, 2024

tomorrow i will trying replace openai api call with litellm

Thanks so much!

@mczhuge
Copy link
Contributor Author

mczhuge commented May 20, 2024

  • dependencies should go in pypackage.yml
  • this file should probably go in opendevin/llm/readers.py
  • we should probably make this functionality available to the LLM class somehow

Yes @rbren I am still getting more familiar with the abstraction of the whole OpenDevin system.

Comment on lines 48 to 50
# TODO: Double-check where the interface for the stored API key is in OpenDevin.
# TODO: Or ask someone familiar to change it with litellm.
OPENAI_API_KEY = 'PUT_YOUR_OPEN_AI_API'
Copy link
Collaborator

@yufansong yufansong May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can take here and here as a reference.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look, perfer to solve it in future. Because we are not sure wehther we will change it into litellm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I simple use the config to get the API now.

@xingyaoww
Copy link
Collaborator

An experimental idea: We can include all these tools (e.g., open pdfs, etc) to the agentskills library: #1941

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do some help to make sure this PR will not block other PRs progress.

  1. Move the file into opendevin/llm/reader.py
  2. Move the installation into poetry pyproject.toml and solve the conflict with main branch.
  3. For cost, we also need add cost in other parts. So prefer to solve them together in other PR.
  4. May discuss whether we will shift to litellm

@xingyaoww @li-boxuan @rbren WDYT

@Shimada666
Copy link
Contributor

Shimada666 commented May 21, 2024

@mczhuge @yufansong

I do some help to make sure this PR will not block other PRs progress.我做了一些帮助,以确保这个 PR 不会阻碍其他 PR 的进展。

  1. Move the file into opendevin/llm/reader.py将文件移至 opendevin/llm/reader.py
  2. Move the installation into poetry pyproject.toml and solve the conflict with main branch.将装置移入诗歌 pyproject.toml ,解决与主分支的冲突。
  3. For cost, we also need add cost in other parts. So prefer to solve them together in other PR.对于成本,我们还需要在其他部分增加成本。所以更愿意在其他 PR 中一起解决它们。
  4. May discuss whether we will shift to litellm可能会讨论我们是否会转向 litellm

@xingyaoww @li-boxuan @rbren WDYT WDYT公司

Oh, I've done some work to replace the OpenAI API call with litellm, and I've also done some code optimizations. Please discuss whether we will shift to litellm! And if litellm is confirmed to be needed, I will submit my code after this PR is merged.

@mczhuge mczhuge requested a review from rbren May 21, 2024 14:43
@mczhuge
Copy link
Contributor Author

mczhuge commented May 21, 2024

@rbren Seems this PR is ready, and @Shimada666 will further improve after merging.

Copy link
Collaborator

@li-boxuan li-boxuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will reside in our core package, we really need some unit tests. It shouldn't be hard to do so for many readers here. I am okay with doing it later as long as we do it eventually...

I added an unit test as an example.

opendevin/llm/readers.py Outdated Show resolved Hide resolved
opendevin/llm/readers.py Outdated Show resolved Hide resolved
opendevin/llm/readers.py Outdated Show resolved Hide resolved
opendevin/llm/readers.py Show resolved Hide resolved
opendevin/llm/readers.py Outdated Show resolved Hide resolved
opendevin/llm/readers.py Outdated Show resolved Hide resolved
opendevin/llm/readers.py Show resolved Hide resolved
opendevin/llm/readers.py Outdated Show resolved Hide resolved
opendevin/llm/readers.py Outdated Show resolved Hide resolved
opendevin/llm/readers.py Outdated Show resolved Hide resolved
poetry.lock Outdated Show resolved Hide resolved
@xingyaoww xingyaoww added architecture related to architecture, including frontend and backend evaluation labels May 22, 2024
@Shimada666
Copy link
Contributor

It seems there are some conflicts that need to be resolved.
I hope this PR can be merged soon. I'm eager to contribute! 😉

@mczhuge
Copy link
Contributor Author

mczhuge commented May 22, 2024

It seems there are some conflicts that need to be resolved.
I hope this PR can be merged soon. I'm eager to contribute! 😉

Me too. Hope this PR could be merged soon.

@mczhuge
Copy link
Contributor Author

mczhuge commented May 22, 2024

It seems there are some conflicts that need to be resolved.
I hope this PR can be merged soon. I'm eager to contribute! 😉

Solved the conflicts.

Copy link
Contributor Author

@mczhuge mczhuge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat functionality! But we need to think through this a bit more thoroughly.

It looks like these readers are reading from paths on the local disk, where the API server is running. They should really be reading files via the runtime, which has access to the user's workspace. Otherwise we'll end up with problems when e.g. the user is using E2B or another remote runtime. Not to mentions security concerns.

For the multimodal cases (e.g. images) we'll need to figure out a better way to stick their output into the EventStream (maybe just the b64 image?), and how to let the agent pass the data to the LLM (which you're getting at with prepare_api_call.

I'm imagining a flow like this:

  • Agent sends a readFile(path="foo.pdf") action
  • We read the bytes of the file from the runtime
  • We do any post-processing of the bytes using these readers (e.g. extracting PDF text)
  • We put the resulting FileReadObservation into the event stream

If there's a FileReadObservation for e.g. a PNG image, the agent would be responsible for doing the work currently in prepare_api_call--we should probably put helper functions in the LLM class (e.g. get_message_for_image(b64: string))

@neubig
Copy link
Contributor

neubig commented May 22, 2024

we'll need to figure out a better way to stick their output into the EventStream (maybe just the b64 image?)

Note that this is already what we do with browser screenshots I believe, so we can probably do the same thing for images read from disk.

@xingyaoww
Copy link
Collaborator

Another alternative is that we can port these multi-modal readers into the agentskills library: #1941

That it:

It won't take too much work for now (just copy to a different file and tweak the prompt) - then we can systematically refractor the whole agentskills library to be compatible with EventStream (e.g., send structured information to the EventStream from the sandbox code execution) - it is the next step of agentskills anyway. How does that sounds? It would help unblock the integration of these GAIA benchmarks.

@Jiayi-Pan
Copy link
Contributor

This is awesome! This looks like a good starting point to get multi-modal understanding / gaia agent started.

Kind of horizontal to this, I do wonder if we should also consider a general approach in the long run (see discussions #1911 between @frankxu2004 @li-boxuan and myself).

Basically, once we enable the agent to directly read browser's pixel observation space, it indirectly solves image, video, pdf understanding, since agent can just open the file in the browser and read it.

For other files like docx, json, since our agent is capable of installing pacakages and have a jupyter repl interface, it can automatically figure out how to do so.

One interesting thing I discovered during testing a DOCX understanding question is that Open-Dev’s agent has a sufficiently broad action space allowing the agent to develop multi-modal understanding skill by itself.
The agent decided to first install the python-docx package and use it within Jupyter to assist with understanding the docx document.

@xingyaoww
Copy link
Collaborator

@Jiayi-Pan agreed! For the time being (i mean for Neurips submission) we can assume agent can only perceive multi-modal via tools (e.g., defined in this PR). But as a natural next step, we should allow the agent to directly perceive raw pixels from browser, which would "solves image, video, pdf understanding, since agent can just open the file in the browser and read it." as you described.

@frankxu2004
Copy link
Collaborator

Agreed. In the long run we can probably eventually rely on OS or browser interface to handle these observations (audio capture, anyone?), without having to hand craft too much about tools. But to make things practical, creating these programmatic tools are useful for now!

@mczhuge
Copy link
Contributor Author

mczhuge commented May 23, 2024

Considering the current situation I think it is a good starting point to first use current reader and link them into action and observation and then improve these functions iteratively.

Considering the concept of a 'pixel observation space', I believe it can become feasible with the release of GPT-5 (or more powerful LLMs). Currently, translating any modality simply by observing can lead to uncontrollable hallucinations and loss of information.

@neubig
Copy link
Contributor

neubig commented May 23, 2024

Thanks for all the discussion everyone! Do we know what the next steps are? Asking just because it's kinda time sensitive.

@Shimada666
Copy link
Contributor

Shimada666 commented May 23, 2024

@neubig
As @xingyaoww mentioned, this is a good opportunity to build the agent skill library, so I have integrated the capabilities of some readers into agent skill as part of this PR. However, the decision on which readers to retain still requires input from everyone. I hope to get everyone's suggestion! See PR: #2016

@xingyaoww
Copy link
Collaborator

Thanks @mczhuge for contributing these multi-modal readers! #2016 refactored those and included them into agentskills -- let's build on top of these tools for a strong GAIA agent :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture related to architecture, including frontend and backend evaluation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants