Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to well model a grid env when it changes frequently? #224

Open
valkryhx opened this issue May 13, 2024 · 8 comments
Open

how to well model a grid env when it changes frequently? #224

valkryhx opened this issue May 13, 2024 · 8 comments
Labels
discussion Discussion of a typical issue or concept environment New or improved environment

Comments

@valkryhx
Copy link

Suppose there is a game, a grid 10 by 10 ,each position was placed a piece of gold with a randomly positive value , and an agent do the mining job on this grid. The rule is when the agent digs a position of pos(i,j), it get the value of the gold on the position,which is v(i,j), and the gold on the row and column which covers the pos(i,j) will be boomed, which means all value of gold on row(i) and col(j) will be 0 . So when the agent digs one position , it gets one value ,and the corresponding one row and one column's value will be 0 .The agent can FLASH( no need to move step by step,but like a teleport) to any position on the grid.

Now we want to train the agent to get as much value as possible, and avoid to step in the digged position(because the digged postion now values 0) ,in different 10 by 10 grids. How to model the observation of the grid?
Should I pass the one frame of current grid env ? Or should I pass the last few steps of last frames of grids? It seems to be a frequently changing env like Go or 2048 game, do you have some advises to model the env like this kind of game?

@puyuan1996 puyuan1996 added the discussion Discussion of a typical issue or concept label May 15, 2024
@puyuan1996
Copy link
Collaborator

To effectively model this game environment and train an agent to maximize the value of gold nuggets obtained from a 10x10 grid, you can take the following steps:

1. Environment Modeling

First, you need to build a simulation environment that can reflect changes in the grid state after each agent's operation. Based on your description, this environment should conform to the Markov Decision Process (MDP) model commonly used in reinforcement learning. Specifically, you can modify this base environment. You can refer to customize environments.

State Representation:

  • Grid State: A 10x10 matrix, where each element represents the gold nugget value at that position.
  • Visited Marking: Optionally, a matrix of the same size, marking the positions that have been mined.

Actions:

  • The agent can choose to mine any grid position that has not yet been mined. Each action can be represented as a coordinate (i, j) in the grid.

Rewards:

  • When the agent chooses to mine at position (i, j), the reward is the gold nugget value v(i, j) at that position. Subsequently, all values in that row and column will be set to 0.

Transitions:

  • After choosing a position, update the grid to reflect the state of the destroyed gold nuggets in that row and column.

2. Observation Modeling

For observations, usually providing the current complete grid state is sufficient, as it contains all the information needed for the next decision. This is similar to many classic board games like Go or Chess, where the agent needs to consider the current global state to make decisions.

  • Single Frame: Provide only the current grid state at each step.
  • Historical Information: Considering some scenarios where it might be necessary to evaluate the impact of previous decisions on the current state, you might consider providing the last few steps' states as additional information to the agent. This can be achieved by stacking the grid states of several recent moments. (However, this is unnecessary in a situation that satisfies the MDP)

3. Training Methods

For training the agent, you could start with MuZero. This configuration guide shows how to run the algorithm on a custom environment.

4. Performance Evaluation

It's important to continually evaluate the agent's performance during development. This can be done by calculating the average score of the agent across multiple independent test environments. Also, monitor any potential issues in the agent's decision-making process, such as an excessive focus on short-term benefits at the expense of long-term strategy. You can refer to this log documentation.

Summary

It's recommended to start with a simple model, using only the current grid state as the observation, and then gradually increase complexity, such as introducing historical states or improving learning algorithms, to optimize the agent's performance. Through continuous iteration and testing, you can find the most suitable approach for this specific problem.

@puyuan1996 puyuan1996 added the environment New or improved environment label May 15, 2024
@valkryhx
Copy link
Author

Thank you for your detailed answer.
I have implemented the codes above but I found the agent's performance is weak.
I use a heuristic algorithm to act as the baseline. This policy is naive and greedy: always choose the position with the max score on current grid. So this policy does not take into account the long-term effect of current choice, and the greedy action can indeed affect (in the bad way ) the potential to get max total sum of score.
But though it is a naive policy, it behaves better than the agent with muzero alg .
When I initialize the grid10X10 with normal random of (0,1) , the greedy agent can get the total sum around 4+,
but the muzero/effect-muzero agent can only get 2 to 3 or even less total score. Each agent can finish the game with exactly 5 steps and the grid is then fullfilled and done with 0.
The observation I use is the last 3 frames (including current state ) of the grid. The training steps is about 4*10^4.
I don't know how to promote the performance of the agent .

@puyuan1996
Copy link
Collaborator

Hello! According to your description, first please confirm: is the episode length fixed at 5 steps? If MuZero is performing poorly and there's no issue with the environment part you've written, one possible reason could be the configuration settings are not suitable. Since the episodes in your environment are very short, many hyperparameters may need corresponding adjustments. Could you please provide your complete configuration file and training loss records? You might want to refer to and modify the configuration file in this link.

@valkryhx
Copy link
Author

Hello! According to your description, first please confirm: is the episode length fixed at 5 steps? If MuZero is performing poorly and there's no issue with the environment part you've written, one possible reason could be the configuration settings are not suitable. Since the episodes in your environment are very short, many hyperparameters may need corresponding adjustments. Could you please provide your complete configuration file and training loss records? You might want to refer to and modify the configuration file in this link.

This is the config , and when the grid size is fixed with 10 , I set the action steps to be always 5,because 5(proper and not conflict or digged position ) is enough to complete the game.

@puyuan1996
Copy link
Collaborator

Hello, I recommend starting by adjusting MuZero rather than EfficientZero, as the latter adds complexity, particularly in predicting the value prefix, which may not be advantageous in your environment. To better diagnose the issue, it would be helpful if you could provide specific log records on TensorBoard. Currently, your action space is set at 100, which could lead to insufficient exploration and potential convergence to local optima. Furthermore, the length of each episode is very short, only 5, with default settings of num_unroll_steps=5 and td_steps=5. You might need to debug to ensure that these data boundaries are being handled appropriately. Our code might not have been adequately tested with such short episodes previously. Thank you.

@valkryhx
Copy link
Author

Thank you for your reply!
I have some questions that I do not know how to solve:
1.the grid size is 10 and 100 positions can be chosen , so I simply use a 100 action space, but just as you said ,this will make more difficult for the agent to learn proper action when facing some observation.So how to lower the size of action space then? Not like the basic left-right-up-down 4 moves agent in a grid, my env is more like a Go board and a stone is placed.
2. the length of each episode is very short, only 5, with default settings of num_unroll_steps=5 and td_steps=5, it's just the case of grid_10_by_10, when the grid size becomes bigger,the value 5 can be bigger . Now when I set it to 5 , is the value invalid ?

@puyuan1996
Copy link
Collaborator

Hello, the action space is fixed at 100, which is considerably smaller compared to the maximum action space of 19*19 in Go. Therefore, theoretically, MuZero should manage this scale effectively. If the learning performance is currently suboptimal, it could be attributed to the action space. This hypothesis can be verified by monitoring metrics such as loss and policy_entropy in TensorBoard. If MuZero is indeed encountering a local optimum, consider increasing the temperature parameter or employing the epsilon-greedy strategy for adjustments. As for the boundary condition when the episode length is set to 5, LightZero should handle it proficiently, but you still need to perform debugging and verification locally to confirm.

@valkryhx
Copy link
Author

Hello, the action space is fixed at 100, which is considerably smaller compared to the maximum action space of 19*19 in Go. Therefore, theoretically, MuZero should manage this scale effectively. If the learning performance is currently suboptimal, it could be attributed to the action space. This hypothesis can be verified by monitoring metrics such as loss and policy_entropy in TensorBoard. If MuZero is indeed encountering a local optimum, consider increasing the temperature parameter or employing the epsilon-greedy strategy for adjustments. As for the boundary condition when the episode length is set to 5, LightZero should handle it proficiently, but you still need to perform debugging and verification locally to confirm.

Thanks for your help! Where can I set the temperature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion of a typical issue or concept environment New or improved environment
Projects
None yet
Development

No branches or pull requests

2 participants