-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to well model a grid env when it changes frequently? #224
Comments
To effectively model this game environment and train an agent to maximize the value of gold nuggets obtained from a 10x10 grid, you can take the following steps: 1. Environment ModelingFirst, you need to build a simulation environment that can reflect changes in the grid state after each agent's operation. Based on your description, this environment should conform to the Markov Decision Process (MDP) model commonly used in reinforcement learning. Specifically, you can modify this base environment. You can refer to customize environments. State Representation:
Actions:
Rewards:
Transitions:
2. Observation ModelingFor observations, usually providing the current complete grid state is sufficient, as it contains all the information needed for the next decision. This is similar to many classic board games like Go or Chess, where the agent needs to consider the current global state to make decisions.
3. Training MethodsFor training the agent, you could start with MuZero. This configuration guide shows how to run the algorithm on a custom environment. 4. Performance EvaluationIt's important to continually evaluate the agent's performance during development. This can be done by calculating the average score of the agent across multiple independent test environments. Also, monitor any potential issues in the agent's decision-making process, such as an excessive focus on short-term benefits at the expense of long-term strategy. You can refer to this log documentation. SummaryIt's recommended to start with a simple model, using only the current grid state as the observation, and then gradually increase complexity, such as introducing historical states or improving learning algorithms, to optimize the agent's performance. Through continuous iteration and testing, you can find the most suitable approach for this specific problem. |
Thank you for your detailed answer. |
Hello! According to your description, first please confirm: is the episode length fixed at 5 steps? If MuZero is performing poorly and there's no issue with the environment part you've written, one possible reason could be the configuration settings are not suitable. Since the episodes in your environment are very short, many hyperparameters may need corresponding adjustments. Could you please provide your complete configuration file and training loss records? You might want to refer to and modify the configuration file in this link. |
This is the config , and when the grid size is fixed with 10 , I set the action steps to be always 5,because 5(proper and not conflict or digged position ) is enough to complete the game. |
Hello, I recommend starting by adjusting MuZero rather than EfficientZero, as the latter adds complexity, particularly in predicting the value prefix, which may not be advantageous in your environment. To better diagnose the issue, it would be helpful if you could provide specific log records on TensorBoard. Currently, your action space is set at 100, which could lead to insufficient exploration and potential convergence to local optima. Furthermore, the length of each episode is very short, only 5, with default settings of num_unroll_steps=5 and td_steps=5. You might need to debug to ensure that these data boundaries are being handled appropriately. Our code might not have been adequately tested with such short episodes previously. Thank you. |
Thank you for your reply! |
Hello, the action space is fixed at 100, which is considerably smaller compared to the maximum action space of 19*19 in Go. Therefore, theoretically, MuZero should manage this scale effectively. If the learning performance is currently suboptimal, it could be attributed to the action space. This hypothesis can be verified by monitoring metrics such as loss and policy_entropy in TensorBoard. If MuZero is indeed encountering a local optimum, consider increasing the temperature parameter or employing the epsilon-greedy strategy for adjustments. As for the boundary condition when the episode length is set to 5, LightZero should handle it proficiently, but you still need to perform debugging and verification locally to confirm. |
Thanks for your help! Where can I set the temperature? |
Suppose there is a game, a grid 10 by 10 ,each position was placed a piece of gold with a randomly positive value , and an agent do the mining job on this grid. The rule is when the agent digs a position of pos(i,j), it get the value of the gold on the position,which is v(i,j), and the gold on the row and column which covers the pos(i,j) will be boomed, which means all value of gold on row(i) and col(j) will be 0 . So when the agent digs one position , it gets one value ,and the corresponding one row and one column's value will be 0 .The agent can FLASH( no need to move step by step,but like a teleport) to any position on the grid.
Now we want to train the agent to get as much value as possible, and avoid to step in the digged position(because the digged postion now values 0) ,in different 10 by 10 grids. How to model the observation of the grid?
Should I pass the one frame of current grid env ? Or should I pass the last few steps of last frames of grids? It seems to be a frequently changing env like Go or 2048 game, do you have some advises to model the env like this kind of game?
The text was updated successfully, but these errors were encountered: