Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on Chapter 18 - loss functions #604

Open
jab2727 opened this issue Feb 18, 2023 · 2 comments
Open

Question on Chapter 18 - loss functions #604

jab2727 opened this issue Feb 18, 2023 · 2 comments

Comments

@jab2727
Copy link

jab2727 commented Feb 18, 2023

Greetings, I'm working through the cartpole example on page 695 of the third edition, and I have a question about the code presented:

def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
                       
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, truncated, info = env.step(int(action))
    return obs, reward, done, truncated, grads

I'm confused about y_target, and why it's an input into the loss function. If the action is False (0), y_target is 1. If the action is True (1), y_target is 0. It seems like we are effectively saying that the model should have been more confident in whatever it's output was. Is that the correct way to think about what y_target is accomplishing? If so, is there something happening in a later step where we're determining if the action recommended by the model was beneficial?

I have similar questions about the loss function presented on page 710, but if I can get some clarification on this earlier example, perhaps I'll understand the more challenging Q-value example.

Thank you!

@ageron
Copy link
Owner

ageron commented Feb 18, 2023

Hi @jab2727 ,

Thanks for your question. You are correct: we are indeed pretending that whatever action the model chose was the correct one, and we're saving the corresponding gradients. Later in the notebook, we determine whether the action was actually good or not, and based on that info we follow the gradient vector in one direction or the other.

Hope this helps!

@jab2727
Copy link
Author

jab2727 commented Feb 20, 2023

Ok, thanks so much for the quick response, very helpful. On page 710 we have the following DQN loss function:

def training_step(batch_size):
    experiences = sample_experiences(batch_size)
    states, actions, rewards, next_states, dones, truncateds = experiences
    next_Q_values = model.predict(next_states, verbose=0)
    max_next_Q_values = next_Q_values.max(axis=1)
    runs = 1.0 - (dones | truncateds)
    target_Q_values = rewards + runs * discount_factor * max_next_Q_values
    target_Q_values = target_Q_values.reshape(-1,1)
    print("The target_Q_values is: ")
    print(target_Q_values)
    mask = tf.one_hot(actions, n_outputs)
    with tf.GradientTape() as tape:
        all_Q_values = model(states)
        Q_values = tf.reduce_sum(all_Q_values * mask, axis=1, keepdims=True)
        print("The actual Q values are: ")
        print(Q_values)
        loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
    
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

Please correct me if I'm wrong, but this example is calculating the loss in a very different way. We're not assuming the action was correct and then determining how many points were earned in the discounting step. We're estimating how many points can be earned in the future with target_Q_values, comparing that to what was actually earned, and feeding those two values into the loss function.

If that's correct, I'm reading through the book's explanation of what's happening in the code, but I'm having trouble understanding what's going on from the mask down. The mask appears to zero out the Q-values, but I'm not clear on how it's only selecting the "ones we do not want". Also, instead of computing the Q-value for every state, would it be possible to instead compute only the Q-value for the single state that produced the max_next_Q_values?

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants