AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero #5

bubble-07 · 2023-11-19T01:32:44Z

Hey there - I've drafted an implementation of a custom environment for the (approximate) matrix semigroup reachability problem, which I've added to my fork here: https://github.com/bubble-07/turbozero. One thing that's currently puzzling me is that it appears that the environment consistently results in the assignment of non-zero visit probabilities to actions which are declared to be invalid. I've added a few debug prints and sys.exit() to core/train/trainer.py to track this down in my fork, and I'm able to reproduce the issue with:

python3 turbozero.py --verbose --gpu --mode=train --config=./example_configs/asmr_mini.yaml --logfile=./asmr_mini.log --debug

Output

torch.Size([6])
Populating Replay Memory...: 100%|██████████████████████████████| 4/4 [00:00<00:00,  8.11it/s]
Collecting self-play episodes...:  25%|██████▎                  | 1/4 [00:00<00:01,  2.03it/s]tensor([[-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-8.9384e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -6.0272e-01],
        [-1.0662e+00, -3.4028e+38, -3.4028e+38, -3.4028e+38, -1.0292e+00],
        [-7.0420e-01, -3.4028e+38, -3.4028e+38, -3.4028e+38, -2.6667e-01]],
       device='cuda:0', grad_fn=<AddBackward0>)
tensor([[0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.2525, 0.5455, 0.0000, 0.0000, 0.2020],
        [0.0100, 0.9800, 0.0000, 0.0000, 0.0100],
        [0.1700, 0.0000, 0.0000, 0.0000, 0.8300]], device='cuda:0')
tensor(inf, device='cuda:0', grad_fn=<MulBackward0>)

The first tensor printed is what the neural net yields for the policy logits (after masking out invalid entries), the second tensor is the empirical visit probabilities, and the third one is what the value for the policy cross-entropy loss is for the iteration. While the neural net's policy is correctly masked, it looks like the empirical visit probabilities aren't, judging from the second column. I've tried to pick apart where all of this was going wrong in the MCTS routine, but I unfortunately wasn't able to get far - have any handy ways to debug this?

The text was updated successfully, but these errors were encountered:

lowrollr · 2023-11-19T22:23:36Z

I was able to reproduce, then correct the behavior you are describing.

First, in 16d2f3f, you move the legal actions assignment to before env.step in mcts.py, meaning that legal actions are calculated using the previous state of the environment rather than the current state, which results in invalid actions being taken. You'd need to revert this change.

This is not the only source of invalid actions, however. I also realized that in my previously implemented environments, I make the assumption that all rewards/evaluations are positive. Your custom environment can assign negative rewards, which breaks some of the action choice logic that assumes rewards/evaluations are always positive. It should be straightforward to allow for negative rewards/evaluations, I'll link a commit to this issue that will resolve this problem.

Finally, I believe the inclusion of label smoothing in cross entropy loss causes policy loss values to be very very large numbers, as the policy logits corresponding to invalid actions are assigned to very large negative numbers rather than zero, leading to large loss accumulated for all of these logits. I will explore ways to allow for label smoothing as well, but the current implementation does not allow for this.

When I removed label smoothing and addressed the other two issues above (and also lowered the learning rate in your provided config file), I saw reasonable loss values and did not detect any invalid actions.

On the topic of debugging, I use the VSCode debugger within a Jupyter notebook and set breakpoints, for this issue I set up some breakpoints to detect when an illegal action was chosen.

I think ideally there should be some built-in assertions to detect this exact situation, as this is how any issue usually manifests itself. Will look into that more as well.

Thank you for your patience and for pointing out this issue! Will have these problems resolved within the next day or two.

lowrollr · 2023-11-20T06:45:14Z

87fd4d8 allows for negative rewards/evaluations.

I'll keep this open until I address label smoothing as well, and perhaps debug asserts for detecting invalid actions in MCTS. Let me know if you run into any other issues in the meantime!

bubble-07 · 2023-12-02T01:24:47Z

Thanks for the attention to this issue - cherry-picking my environment on top of the most recent change-sets completely resolves the issue with negative reward values resulting in invalid actions! I'm also seeing the stabilization in training dynamics with a lower learning rate, and so guess I'm off to the races.

My apologies about the bit where I swapped around logic with the legal actions assignment - I only made that change out of a "throw-spaghetti-and-see-if-it-sticks" approach to debugging as a last resort, and I'm sorry if it complicated the investigation at all.

Adding built-in assertions to check the integrity of invariants about MCTS could be useful - maybe having them on only for debug=true configs to ensure that perf doesn't take a hit?

I'm content with the resolution here, but I won't close the issue now, given that you have some other things that you want to tackle before declaring this one closed.

Thanks again!

lowrollr self-assigned this Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero #5

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero #5

bubble-07 commented Nov 19, 2023

lowrollr commented Nov 19, 2023 •

edited

lowrollr commented Nov 20, 2023

bubble-07 commented Dec 2, 2023

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero #5

AlphaZero+MCTS: Visit probabilities for invalid actions can be non-zero #5

Comments

bubble-07 commented Nov 19, 2023

lowrollr commented Nov 19, 2023 • edited

lowrollr commented Nov 20, 2023

bubble-07 commented Dec 2, 2023

lowrollr commented Nov 19, 2023 •

edited