Reward loss targets don't account for episodes that finish within n steps #3

srinivr · 2018-08-16T07:02:04Z

Hi,

Great work. I enjoyed reading the paper and I replicated your work independently.

I noticed minor performance difference between the two implementations and I just noticed that while computing the target for reward loss, you aren't accounting for the episodes that finished within nsteps (whereas for q loss it is being correctly accounted for).

Specifically, proc_seq.append(seq[env, t+offset:t+offset+depth, :]) in line 32 of treeqn_utils.py doesn't check for done flags.

If the episode finished at time t=3, this error will make the target for d=1 at t=3 to be reward at t=4, which is wrong. Can you please clarify if my understanding is correct?

The text was updated successfully, but these errors were encountered:

Greg-Farquhar · 2018-08-16T09:59:14Z

Hi,
Thanks a lot, that definitely looks wrong! I think I was handling this correctly in an earlier version and broke it when "simplifying" for the code release 😅. I'll try to fix this if I find a little time, or you're welcome to make a pull request.

zacwellmer · 2018-09-09T15:20:07Z

#4

Here's what I think is a quick fix. There could probably be a quicker vectorized implementation though

Greg-Farquhar · 2018-09-18T04:33:53Z

Thanks for the bug spot and fix; sorry for being so slow -- merged now!

zacwellmer · 2018-10-03T07:11:32Z

@Greg-Farquhar I think the np -> torch commit introduced a new bug. In make_seq_mask the mask variable is being updated in place and will cause our tmp_masks variable to change in the build_sequences function.

mask[int(max_i):].fill_(1)

I also think that I initially padded tmp_masks the wrong way. I forgot the pytorch pads in reverse, and we should also be padding with 1's since they will be flipped to zeros in the make_seq_mask function. If both of these changes sound correct I'll submit a PR

Greg-Farquhar · 2018-10-03T13:46:21Z

Ah, thanks. Just pushed, lmk if that fixes correctly!

zacwellmer · 2018-10-04T02:12:05Z

looks good!

zacwellmer mentioned this issue Sep 9, 2018

0 out sequence values when termination occurs before nsteps is over #4

Merged

Greg-Farquhar closed this as completed Sep 18, 2018

Greg-Farquhar reopened this Oct 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward loss targets don't account for episodes that finish within n steps #3

Reward loss targets don't account for episodes that finish within n steps #3

srinivr commented Aug 16, 2018

Greg-Farquhar commented Aug 16, 2018

zacwellmer commented Sep 9, 2018 •

edited

Greg-Farquhar commented Sep 18, 2018

zacwellmer commented Oct 3, 2018

Greg-Farquhar commented Oct 3, 2018

zacwellmer commented Oct 4, 2018

Reward loss targets don't account for episodes that finish within n steps #3

Reward loss targets don't account for episodes that finish within n steps #3

Comments

srinivr commented Aug 16, 2018

Greg-Farquhar commented Aug 16, 2018

zacwellmer commented Sep 9, 2018 • edited

Greg-Farquhar commented Sep 18, 2018

zacwellmer commented Oct 3, 2018

Greg-Farquhar commented Oct 3, 2018

zacwellmer commented Oct 4, 2018

zacwellmer commented Sep 9, 2018 •

edited