Question on Chapter 18 - loss functions #604

jab2727 · 2023-02-18T01:16:27Z

Greetings, I'm working through the cartpole example on page 695 of the third edition, and I have a question about the code presented:

def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
                       
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, truncated, info = env.step(int(action))
    return obs, reward, done, truncated, grads

I'm confused about y_target, and why it's an input into the loss function. If the action is False (0), y_target is 1. If the action is True (1), y_target is 0. It seems like we are effectively saying that the model should have been more confident in whatever it's output was. Is that the correct way to think about what y_target is accomplishing? If so, is there something happening in a later step where we're determining if the action recommended by the model was beneficial?

I have similar questions about the loss function presented on page 710, but if I can get some clarification on this earlier example, perhaps I'll understand the more challenging Q-value example.

Thank you!

The text was updated successfully, but these errors were encountered:

ageron · 2023-02-18T07:33:00Z

Hi @jab2727 ,

Thanks for your question. You are correct: we are indeed pretending that whatever action the model chose was the correct one, and we're saving the corresponding gradients. Later in the notebook, we determine whether the action was actually good or not, and based on that info we follow the gradient vector in one direction or the other.

Hope this helps!

jab2727 · 2023-02-20T22:22:06Z

Ok, thanks so much for the quick response, very helpful. On page 710 we have the following DQN loss function:

def training_step(batch_size):
    experiences = sample_experiences(batch_size)
    states, actions, rewards, next_states, dones, truncateds = experiences
    next_Q_values = model.predict(next_states, verbose=0)
    max_next_Q_values = next_Q_values.max(axis=1)
    runs = 1.0 - (dones | truncateds)
    target_Q_values = rewards + runs * discount_factor * max_next_Q_values
    target_Q_values = target_Q_values.reshape(-1,1)
    print("The target_Q_values is: ")
    print(target_Q_values)
    mask = tf.one_hot(actions, n_outputs)
    with tf.GradientTape() as tape:
        all_Q_values = model(states)
        Q_values = tf.reduce_sum(all_Q_values * mask, axis=1, keepdims=True)
        print("The actual Q values are: ")
        print(Q_values)
        loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
    
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

Please correct me if I'm wrong, but this example is calculating the loss in a very different way. We're not assuming the action was correct and then determining how many points were earned in the discounting step. We're estimating how many points can be earned in the future with target_Q_values, comparing that to what was actually earned, and feeding those two values into the loss function.

If that's correct, I'm reading through the book's explanation of what's happening in the code, but I'm having trouble understanding what's going on from the mask down. The mask appears to zero out the Q-values, but I'm not clear on how it's only selecting the "ones we do not want". Also, instead of computing the Q-value for every state, would it be possible to instead compute only the Q-value for the single state that produced the max_next_Q_values?

Thank you again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Chapter 18 - loss functions #604

Question on Chapter 18 - loss functions #604

jab2727 commented Feb 18, 2023

ageron commented Feb 18, 2023

jab2727 commented Feb 20, 2023

Question on Chapter 18 - loss functions #604

Question on Chapter 18 - loss functions #604

Comments

jab2727 commented Feb 18, 2023

ageron commented Feb 18, 2023

jab2727 commented Feb 20, 2023