Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero #5

Open
bubble-07 opened this issue Nov 19, 2023 · 3 comments
Assignees

Comments

@bubble-07
Copy link

Hey there - I've drafted an implementation of a custom environment for the (approximate) matrix semigroup reachability problem, which I've added to my fork here: https://github.com/bubble-07/turbozero. One thing that's currently puzzling me is that it appears that the environment consistently results in the assignment of non-zero visit probabilities to actions which are declared to be invalid. I've added a few debug prints and sys.exit() to core/train/trainer.py to track this down in my fork, and I'm able to reproduce the issue with:

python3 turbozero.py --verbose --gpu --mode=train --config=./example_configs/asmr_mini.yaml --logfile=./asmr_mini.log --debug

Output

torch.Size([6])
Populating Replay Memory...: 100%|██████████████████████████████| 4/4 [00:00<00:00,  8.11it/s]
Collecting self-play episodes...:  25%|██████▎                  | 1/4 [00:00<00:01,  2.03it/s]tensor([[-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-1.0662e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -1.0292e+00],
        [-7.0420e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -2.6667e-01]],
       device='cuda:0', grad_fn=<AddBackward0>)
tensor([[0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.0100, 0.9800, 0.0000, 0.0000, 0.0100],
        [0.1700, 0.0000, 0.0000, 0.0000, 0.8300]], device='cuda:0')
tensor(inf, device='cuda:0', grad_fn=<MulBackward0>)

The first tensor printed is what the neural net yields for the policy logits (after masking out invalid entries), the second tensor is the empirical visit probabilities, and the third one is what the value for the policy cross-entropy loss is for the iteration. While the neural net's policy is correctly masked, it looks like the empirical visit probabilities aren't, judging from the second column. I've tried to pick apart where all of this was going wrong in the MCTS routine, but I unfortunately wasn't able to get far - have any handy ways to debug this?

@lowrollr lowrollr self-assigned this Nov 19, 2023
@lowrollr
Copy link
Owner

lowrollr commented Nov 19, 2023

I was able to reproduce, then correct the behavior you are describing.

First, in 16d2f3f, you move the legal actions assignment to before env.step in mcts.py, meaning that legal actions are calculated using the previous state of the environment rather than the current state, which results in invalid actions being taken. You'd need to revert this change.

This is not the only source of invalid actions, however. I also realized that in my previously implemented environments, I make the assumption that all rewards/evaluations are positive. Your custom environment can assign negative rewards, which breaks some of the action choice logic that assumes rewards/evaluations are always positive. It should be straightforward to allow for negative rewards/evaluations, I'll link a commit to this issue that will resolve this problem.

Finally, I believe the inclusion of label smoothing in cross entropy loss causes policy loss values to be very very large numbers, as the policy logits corresponding to invalid actions are assigned to very large negative numbers rather than zero, leading to large loss accumulated for all of these logits. I will explore ways to allow for label smoothing as well, but the current implementation does not allow for this.

When I removed label smoothing and addressed the other two issues above (and also lowered the learning rate in your provided config file), I saw reasonable loss values and did not detect any invalid actions.

On the topic of debugging, I use the VSCode debugger within a Jupyter notebook and set breakpoints, for this issue I set up some breakpoints to detect when an illegal action was chosen.

I think ideally there should be some built-in assertions to detect this exact situation, as this is how any issue usually manifests itself. Will look into that more as well.

Thank you for your patience and for pointing out this issue! Will have these problems resolved within the next day or two.

@lowrollr
Copy link
Owner

87fd4d8 allows for negative rewards/evaluations.

I'll keep this open until I address label smoothing as well, and perhaps debug asserts for detecting invalid actions in MCTS. Let me know if you run into any other issues in the meantime!

@bubble-07
Copy link
Author

Thanks for the attention to this issue - cherry-picking my environment on top of the most recent change-sets completely resolves the issue with negative reward values resulting in invalid actions! I'm also seeing the stabilization in training dynamics with a lower learning rate, and so guess I'm off to the races.

My apologies about the bit where I swapped around logic with the legal actions assignment - I only made that change out of a "throw-spaghetti-and-see-if-it-sticks" approach to debugging as a last resort, and I'm sorry if it complicated the investigation at all.

Adding built-in assertions to check the integrity of invariants about MCTS could be useful - maybe having them on only for debug=true configs to ensure that perf doesn't take a hit?

I'm content with the resolution here, but I won't close the issue now, given that you have some other things that you want to tackle before declaring this one closed.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants