Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If alpha zero compete with a TicTacToe beginner, how does it ensure it never loss? (a convergence question) #252

Open
zhengrongpeter opened this issue Sep 17, 2021 · 4 comments

Comments

@zhengrongpeter
Copy link

zhengrongpeter commented Sep 17, 2021

What I mean is that after a few iterations of self-play and training, I think MCT will converge and always place the first piece in the center of the 3*3 board (starting from the center has the highest probability to win). Then, future games generated by self-games will always start from the center.
Since the early dataset will be removed from the buffer (due to maximum size limitation), the dataset used for training will contain only expert-level games. Since neural networks may forget the early training patterns as the early training data are removed, I think future MCTs will forget how to deal with a beginner who may not put the first piece in the center.
If I place the first piece in the corner, and alpha zero is the second player, will alpha zero never lose?

@goshawk22
Copy link
Contributor

The alpha-zero agent can learn simple games, such as TicTacToe very quickly, to the point where it has solved them after only a few iterations. In the case of TicTacToe, I found that the agent learns that, against an optimal player, it can never win. This leads to the agent playing only moves that would result in a draw and it doesn't seem to try to win.
One solution to this, in games where the first or second player is guaranteed a win if they play optimally, such as Connect4, is to introduce a decay in the reward. This would mean that the longer the agent doesn't lose, the higher it scores.
Another solution, in the case of games guaranteed to end in a draw, is to train the agent against a sub-optimal or even random player. However, this would be a divergence from the Alpha-Zero paper.

@zhengrongpeter
Copy link
Author

zhengrongpeter commented Sep 18, 2021

Yeah, the predicted state value for the first player is really high, but the value for the second player is really low. It's sufficient to say the second player cannot win if the first player is an expert. Alpha-zero might think this game is too unequal, so it's unnecessary to train anymore... Also, after many iterations, alpha-zero for tictactoe will draw frequently, so if the earlier training data are removed, the model might forget some tictactoe records for beginners. I think it does matter if a beginner is the first player and alpha-zero is the second player since the second player has only trained with an expert who always starts the game in a certain sequence of moves... I think playing with a random player is a simple but valid solution. Probably for Go or Chess, it won't be a problem.

@goshawk22
Copy link
Contributor

Yes, maybe a good training method for TicTacToe would be to to play fewer games per iteration (maybe around 20?) as there are far fewer possible states. I don't think it would take more than 20 iterations for the model to solve the game, even with this data, so the model would still be training on "beginner" data.

@Bobingstern
Copy link

How about adding a little bit of noise to the first few moves of a game to randomize each opening? This may lead to it knowing how to play all types of openings and not just optimal ones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants