New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on Chapter 18 - loss functions #604
Comments
Hi @jab2727 , Thanks for your question. You are correct: we are indeed pretending that whatever action the model chose was the correct one, and we're saving the corresponding gradients. Later in the notebook, we determine whether the action was actually good or not, and based on that info we follow the gradient vector in one direction or the other. Hope this helps! |
Ok, thanks so much for the quick response, very helpful. On page 710 we have the following DQN loss function:
Please correct me if I'm wrong, but this example is calculating the loss in a very different way. We're not assuming the action was correct and then determining how many points were earned in the discounting step. We're estimating how many points can be earned in the future with target_Q_values, comparing that to what was actually earned, and feeding those two values into the loss function. If that's correct, I'm reading through the book's explanation of what's happening in the code, but I'm having trouble understanding what's going on from the mask down. The mask appears to zero out the Q-values, but I'm not clear on how it's only selecting the "ones we do not want". Also, instead of computing the Q-value for every state, would it be possible to instead compute only the Q-value for the single state that produced the max_next_Q_values? Thank you again! |
Greetings, I'm working through the cartpole example on page 695 of the third edition, and I have a question about the code presented:
I'm confused about y_target, and why it's an input into the loss function. If the action is False (0), y_target is 1. If the action is True (1), y_target is 0. It seems like we are effectively saying that the model should have been more confident in whatever it's output was. Is that the correct way to think about what y_target is accomplishing? If so, is there something happening in a later step where we're determining if the action recommended by the model was beneficial?
I have similar questions about the loss function presented on page 710, but if I can get some clarification on this earlier example, perhaps I'll understand the more challenging Q-value example.
Thank you!
The text was updated successfully, but these errors were encountered: