Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails with small models including llama3-8b #157

Open
TimeLordRaps opened this issue Apr 21, 2024 · 5 comments
Open

Fails with small models including llama3-8b #157

TimeLordRaps opened this issue Apr 21, 2024 · 5 comments
Labels

Comments

@TimeLordRaps
Copy link

TimeLordRaps commented Apr 21, 2024

Before submitting an issue, make sure you read the FAQ.md

Briefly describe your issue

I modified the code so that I could run this with any chatmodel on langchain so that I could test it out on multiple different models, this took a few hours so I don't recall everything I had to do to make it work but my diffs indicate:

  1. replacing the self.model_name with the ChatOllama/ChatGrok MODEL so that it only needs to be loaded once, matters more for Ollama,
    a. then changing the use of self.model_name to be the llm or qa_llm or whatever other llms voyager has defined throughout the agent folders
    b. then adding invoke to the llm calls.
  2. And I changed out the OAI embeddings for snowflake-arctic-embed-l using langchain_community.embeddings' HuggingFaceEmbeddings

The preliminary results on llama3 are shown below, but I have also tested both codeqwen and wizardlm2 7b models and both seem worse than llama3 as to be expected, though codeqwen writes good code it fails in seemingly the same way I have described llama3-8b failing below.

Please provide your python, nodejs, Minecraft, and Fabric versions here

python 3.10, node 20.11.0, minecraft 1.20.4, fabric 0.15.9

[If applicable] Please provide the Minefalyer and Minecraft logs, you can find the log under logs folder

Too many to list so I will instead note that seemingly llama3-8b fails where llama3-70b succeeds specifically in spatial common-sense task composition reasoning. What I mean is llama3-8b seemingly has a lack of ordering of what needs to be done to accomplish most multi-action steps. It's ability to compose skills seems prohibitive, whereas after >30 iterations llama3-8b failed to make a pickaxe, or crafting table, at least that I could see in its inventory from convo logs. A better highlight of this is the comparison of completed tasks to failed tasks:

completed tasks: ["Mine 1 wood log"]

failed tasks: ["Mine 1-2 dirt blocks", "Mine 3-4 gravel blocks", "Mine 1 diorite block", "Mine 2-3 copper ore blocks", "Mine 1-2 diorite blocks", "Mine 1-2 stone blocks", "Mine 1-2 stone blocks", "Mine 1-2 copper ore blocks", "Mine 1-2 diorite blocks", "Mine 1-2 diorite blocks", "Mine 1-2 diorite blocks"]

I think the decision space for which it wants to progress through is too constrained to high level objectives, and it fails in compositionality of using smaller tasks to build up a skill library, without looking at the specific prompts used it's hard to tell if this is just something overlooked by the voyager creators because gpt4 could just do it de facto or if specific prompt engineering for composition could improve llama3-8b chances.

For comparison here is llama3-70b's tasks after 20 iterations:
completed tasks: ["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs"]

failed tasks: ["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe"]

So we see that while llama3-70b still is continuously failing to craft a wooden pickaxe, I'm not sure that will be the case forever so I'll keep running, but it fails on trying to make a pickaxe, llama3-8b fails to try to make a pickaxe. It's a very subtle difference but I think it highlights the difference between the models well in terms of their possible abilities relating to task compositionality.

[If applicable] Please provide the GPT conversations that are printed each round.

Instead, I will provide a note that llama3-8b and seemingly llama3-70b both seem to fail in biomes where oak logs are not immediately available, specifically despite their being say spruce or jungle logs in nearby blocks they still will only write code for oak logs.

The above completed and fail tasks are from relatively simple starting biomes such as forests.

I'll try to update if anything spectacular happens before/around the 100 iterations point and if grok will let me up to the 500 and 1000 iterations points on llama3-70b.

This issue is simply a means of saving anyone time thinking of doing something similar.

update ~40 iterations:
completed tasks: ["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel"]

failed tasks: ["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Mine 3 stone blocks"]

@TimeLordRaps
Copy link
Author

It seems to fail crafting a wooden pickaxe because this:
Context: Question: How to craft a wooden pickaxe in Minecraft?
Answer: To craft a wooden pickaxe in Minecraft, you need to arrange the following items in a crafting table: 3 wooden planks in a diagonal line from top-left to bottom-right, and 2 sticks in the middle row, one in the second column and one in the fourth column. This will give you a wooden pickaxe.

@TimeLordRaps
Copy link
Author

TimeLordRaps commented Apr 21, 2024

Eventually figured out the wooden pickaxe, I'm going to try to run to as many iterations as I can until I just keep getting continuously hit with rate limits.

Iteration 108:
Completed tasks:["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel", "Obtain 3 birch logs", "Mine 3 dirt blocks", "Mine 1 coal ore", "Mine 1 copper ore", "Mine 3 cobblestone blocks", "Obtain 2 oak logs", "Mine 3 stone blocks", "Kill 1 salmon"]

Failed Tasks: ["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Obtain 3 cobblestone blocks", "Craft a wooden pickaxe", "Mine 1 stone block", "Smelt 4 copper ore", "Smelt 4 copper ore", "Smelt 1 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Obtain a jungle_log", "Smelt 4 copper ore", "Craft a stone pickaxe"]

@TimeLordRaps
Copy link
Author

iteration 272:
Completed tasks:
["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel", "Obtain 3 birch logs", "Mine 3 dirt blocks", "Mine 1 coal ore", "Mine 1 copper ore", "Mine 3 cobblestone blocks", "Obtain 2 oak logs", "Mine 3 stone blocks", "Kill 1 salmon", "Equip a stone pickaxe", "Mine 3 coal ore", "Obtain 2 birch logs", "Mine 3 birch logs", "Place a chest", "Craft 4 birch planks", "Mine 5 dirt blocks"]

Failed tasks:
["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Obtain 3 cobblestone blocks", "Craft a wooden pickaxe", "Mine 1 stone block", "Smelt 4 copper ore", "Smelt 4 copper ore", "Smelt 1 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Obtain a jungle_log", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Craft a stone pickaxe", "Kill 1 squid", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone sword", "Craft a stone pickaxe", "Smelt 1 copper ore", "Smelt 4 copper ore", "Smelt 3 copper ore", "Cook 1 salmon", "Mine 1 iron ore", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a stone axe", "Craft a wooden pickaxe", "Smelt 4 copper ore", "Craft a wooden pickaxe", "Obtain a jungle log", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 1 birch log", "Kill 1 skeleton", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack"]

@TimeLordRaps
Copy link
Author

Around 320 iterations it continuously runs into rate limits:

Completed tasks: ["Mine 1 wood log", "Craft a crafting_table", "Craft 4 wooden planks", "Obtain 3 oak logs", "Craft a wooden shovel", "Obtain 3 birch logs", "Mine 3 dirt blocks", "Mine 1 coal ore", "Mine 1 copper ore", "Mine 3 cobblestone blocks", "Obtain 2 oak logs", "Mine 3 stone blocks", "Kill 1 salmon", "Equip a stone pickaxe", "Mine 3 coal ore", "Obtain 2 birch logs", "Mine 3 birch logs", "Place a chest", "Craft 4 birch planks", "Mine 5 dirt blocks", "Mine 1 cobblestone block", "Craft 1 Birch Fence"]

Failed tasks:

["Craft a wooden pickaxe", "Craft a wooden_pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a wooden door", "Craft a stone pickaxe", "Craft a stone shovel", "Craft a stone shovel", "Craft a wooden sword", "Obtain 3 cobblestone blocks", "Craft a wooden pickaxe", "Mine 1 stone block", "Smelt 4 copper ore", "Smelt 4 copper ore", "Smelt 1 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Obtain a jungle_log", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone pickaxe", "Craft a stone pickaxe", "Kill 1 squid", "Smelt 4 copper ore", "Craft a stone pickaxe", "Craft a stone sword", "Craft a stone pickaxe", "Smelt 1 copper ore", "Smelt 4 copper ore", "Smelt 3 copper ore", "Cook 1 salmon", "Mine 1 iron ore", "Craft a wooden pickaxe", "Craft a wooden pickaxe", "Craft a stone axe", "Craft a wooden pickaxe", "Smelt 4 copper ore", "Craft a wooden pickaxe", "Obtain a jungle log", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 1 birch log", "Kill 1 skeleton", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 5 Netherrack", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Mine 5 Netherrack", "Mine 5 Netherrack", "Mine 1 Netherrack block", "Mine 5 Netherrack blocks", "Mine 1 Netherrack block", "Craft a Birch Fence", "Craft a stone pickaxe", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Craft a wooden axe", "Craft a Birch Door", "Craft a stone pickaxe", "Craft a Birch Door", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Mine 5 gravel blocks", "Mine 1 Netherrack block", "Craft a stone pickaxe", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Craft a stone axe", "Open the chest at (24, 72, -5)", "Mine 5 cobblestone blocks", "Mine 5 cobblestone blocks", "Open the chest at (24, 72, -5)", "Mine 5 Netherrack blocks", "Mine 5 Netherrack blocks", "Mine 1 iron ore", "Open the chest at (24, 72, -5)", "Mine 5 Netherrack blocks", "Open the chest at (24, 72, -5)", "Open the chest at (24, 72, -5)", "Craft a stone pickaxe", "Mine 5 Netherrack blocks", "Open the chest at (24, 72, -5)", "Craft a stone pickaxe"]

I'm currently fine-tuning llama-3 then phi-3 on all the code and descriptions from gpt-4 from the longest voyager checkpoint of 255 iterations, and from the three trials, just to see if either can perform better with fine tuning.

Copy link

This issue is stale because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant