Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reward loss targets don't account for episodes that finish within n steps #3

Open
srinivr opened this issue Aug 16, 2018 · 6 comments

Comments

@srinivr
Copy link

srinivr commented Aug 16, 2018

Hi,

Great work. I enjoyed reading the paper and I replicated your work independently.

I noticed minor performance difference between the two implementations and I just noticed that while computing the target for reward loss, you aren't accounting for the episodes that finished within nsteps (whereas for q loss it is being correctly accounted for).

Specifically, proc_seq.append(seq[env, t+offset:t+offset+depth, :]) in line 32 of treeqn_utils.py doesn't check for done flags.

If the episode finished at time t=3, this error will make the target for d=1 at t=3 to be reward at t=4, which is wrong. Can you please clarify if my understanding is correct?

@Greg-Farquhar
Copy link
Collaborator

Hi,
Thanks a lot, that definitely looks wrong! I think I was handling this correctly in an earlier version and broke it when "simplifying" for the code release 😅. I'll try to fix this if I find a little time, or you're welcome to make a pull request.

@zacwellmer
Copy link
Contributor

zacwellmer commented Sep 9, 2018

#4

Here's what I think is a quick fix. There could probably be a quicker vectorized implementation though

@Greg-Farquhar
Copy link
Collaborator

Thanks for the bug spot and fix; sorry for being so slow -- merged now!

@zacwellmer
Copy link
Contributor

@Greg-Farquhar I think the np -> torch commit introduced a new bug. In make_seq_mask the mask variable is being updated in place and will cause our tmp_masks variable to change in the build_sequences function.

mask[int(max_i):].fill_(1)

I also think that I initially padded tmp_masks the wrong way. I forgot the pytorch pads in reverse, and we should also be padding with 1's since they will be flipped to zeros in the make_seq_mask function. If both of these changes sound correct I'll submit a PR

@Greg-Farquhar
Copy link
Collaborator

Ah, thanks. Just pushed, lmk if that fixes correctly!

@Greg-Farquhar Greg-Farquhar reopened this Oct 3, 2018
@zacwellmer
Copy link
Contributor

looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants