Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support ask_review by llm #959

Open
wants to merge 12 commits into
base: code_interpreter
Choose a base branch
from

Conversation

orange-crow
Copy link
Contributor

@orange-crow orange-crow commented Mar 5, 2024

add ask_review by llm, and adjust the generated code to fit the jupyter notebook.

Features

  • add 3 review_type : ["human", "llm", "confirm_all"];

Result

If the generated code is executed successfully, but the code does not match the task requirements, it can be detected through llm's ask review, and it will make modification suggestions. The example is as follows,

@pytest.mark.asyncio
async def test_ask_review_llm():
    context = [
        Message("Train a model to predict wine class using the training set."),
        Message(
               """
               from sklearn.datasets import load_wine
               wine_data = load_wine()
               plt.hist(wine_data.target, bins=len(wine_data.target_names))
               plt.xlabel('Class')
               plt.ylabel('Number of Samples')
               plt.title('Distribution of Wine Classes')
               plt.xticks(range(len(wine_data.target_names)), wine_data.target_names)
               plt.show()
               """
        ),
    ]
    rsp, confirmed = await AskReview().run(context, review_type="llm")
    assert rsp.startswith(("redo", "change"))   # -> True
    assert not confirmed                        # -> True
    pirnt(rsp)
    # ```
    # redo the task, the provided code only includes data loading and visualization, but does not include any steps related 
    # to training a model.
    # ```

image

@codecov-commenter
Copy link

codecov-commenter commented Mar 5, 2024

Codecov Report

Attention: Patch coverage is 92.30769% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 82.73%. Comparing base (0271cd7) to head (5ff3cd1).
Report is 17 commits behind head on code_interpreter.

Files Patch % Lines
metagpt/actions/di/ask_review.py 91.30% 2 Missing ⚠️
metagpt/roles/di/data_interpreter.py 75.00% 1 Missing ⚠️
metagpt/strategy/planner.py 93.33% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@                 Coverage Diff                  @@
##           code_interpreter     #959      +/-   ##
====================================================
+ Coverage             82.70%   82.73%   +0.03%     
====================================================
  Files                   223      223              
  Lines                 13129    13164      +35     
====================================================
+ Hits                  10858    10891      +33     
- Misses                 2271     2273       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@orange-crow orange-crow changed the title support ask_review by llm, and adjust the generated code to fit the j… support ask_review by llm, and adjust the generated code to fit the jupyter notebook. Mar 5, 2024
metagpt/actions/mi/ask_review.py Outdated Show resolved Hide resolved
tests/metagpt/actions/mi/test_ask_review.py Outdated Show resolved Hide resolved
Comment on lines 113 to 126
if confirmed and task_result and task_result.code:
review_msg = (
"The code was executed successfully. Please review the code and execution results to evaluate"
f"execution results are: {task_result.result}"
"whether the task was completed."
)
elif not confirmed and task_result and task_result.code:
review_msg = (
"The code execution failed. Please reflect on the reason for the failure based on the above content"
f"execution results are: {task_result.result}"
"and try another method to generate new code."
)
else:
review_msg = "Pleas review above content."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议全部写进一个prompt,没必要在代码上做条件分叉。从上下文,llm应该能知道前面的代码是成功了还是失败了

context: list[Message] = [],
plan: Plan = None,
trigger: str = ReviewConst.TASK_REVIEW_TRIGGER,
review_type: Literal["human", "llm", "confirm_all"] = "human",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirm_all可以改为disabled。另外,这里多了一个入参,上层planner和interpreter却没看到有引入。可以去掉auto_run,使用review_type属性,并向下做透传

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planner和interpreter中的auto_run也删掉吗?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对的

@garylin2099
Copy link
Collaborator

add ask_review by llm, and adjust the generated code to fit the jupyter notebook.

Features

  • add 3 review_type : ["human", "llm", "confirm_all"];
  • update DEFAULT_SYSTEM_MSG in BaseWriteAnalysisCode, Make it clear that the executor is jupyter notebook.

Result

If the generated code is executed successfully, but the code does not match the task requirements, it can be detected through llm's ask review, and it will make modification suggestions. The example is as follows,

@pytest.mark.asyncio
async def test_ask_review_llm():
    context = [
        Message("Train a model to predict wine class using the training set."),
        Message(
               """
               from sklearn.datasets import load_wine
               wine_data = load_wine()
               plt.hist(wine_data.target, bins=len(wine_data.target_names))
               plt.xlabel('Class')
               plt.ylabel('Number of Samples')
               plt.title('Distribution of Wine Classes')
               plt.xticks(range(len(wine_data.target_names)), wine_data.target_names)
               plt.show()
               """
        ),
    ]
    rsp, confirmed = await AskReview().run(context, review_type="llm")
    assert rsp.startswith(("redo", "change"))   # -> True
    assert not confirmed                        # -> True
    pirnt(rsp)
    # ```
    # change task current task, split the task into two separate tasks: 
    # 1. Train a model to predict wine class using the training set.
    # 2. Visualize the distribution of wine classes with a histogram.
    # ```

image

Prefer a more practical example. Review is useful when errors occur. Perhaps give an example showing how it handles errors, such as suggesting "redo" with feedback, or "change" with updated current task instruction?

…upyter notebook.

- add 3 review_type : ["human", "llm", "confirm_all"];
- update DEFAULT_SYSTEM_MSG in BaseWriteAnalysisCode, Make it clear that the executor is jupyter notebook.
- Refine sys_msg as a class ReviewConst variable SYS_MSG
- Use rsp.strip() instead of removing \n explicitly
- Rename _rsp to llm_rsp for clarity
- Correct indentation
- Remove `confirm_all` parameter and use `review_type` instead
- Consolidate prompts into a single one for better context
- Add tests for the changes

This commit addresses the code review comments:

1. Removed the `confirm_all` parameter and introduced the `review_type`
  attribute which is propagated down from the planner and interpreter layers.
  The `auto_run` parameter has been removed as well.

2. Consolidated the prompts into a single one to provide better context
  to the LLM about the state of the previous code execution.
@orange-crow orange-crow changed the title support ask_review by llm, and adjust the generated code to fit the jupyter notebook. support ask_review by llm Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants