Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weights are not updating #39

Open
vmarar opened this issue Jan 5, 2021 · 8 comments
Open

Weights are not updating #39

vmarar opened this issue Jan 5, 2021 · 8 comments

Comments

@vmarar
Copy link

vmarar commented Jan 5, 2021

Hi! My weights don't seem to be changing.

I printed out the grad for each of the layers and it comes back as none, the grad of the hidden layer is none as well.

In the code the only thing I changed in the reinforcement file was rl_loss -= (log_probs[0, top_i].cpu().detach().numpy()*discounted_reward) and then tried rl_loss -= (log_probs[0, top_i].item()*discounted_reward) due to an issue with cuda device = 0.

Would you know any reason why grad would be coming up as none for all of the layers? I believe this is why optimizer.step() is not working and the weights are not updating.

@Mariewelt
Copy link
Collaborator

Hi @vmarar
I'd say this is expected behavior. Further in the code rl_loss is backpropagated by applying rl_loss.backward(). But your changes detach rl_loss from the computational graph and there is nothing to backpropagate. Is there any specific reason why you need to do this? What kind of issues/errors do you have with your cuda device?

@vmarar
Copy link
Author

vmarar commented Jan 6, 2021

would this be the case using rl_loss -= (log_probs[0, top_i].item()*discounted_reward) as well? Would there still be a detachment?
The error is : can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first

I have currently have cuda visible devices set to 0. Is there a way to bypass this error?

Thank you!

@Mariewelt
Copy link
Collaborator

would this be the case using rl_loss -= (log_probs[0, top_i].item()*discounted_reward) as well? Would there still be a detachment?

Yes, this is exactly what .item() does. It returns just the value of a tensor.

The error is : can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first

Which line triggers this error? Could you please provide a traceback?

@vmarar
Copy link
Author

vmarar commented Jan 6, 2021

Screenshot from 2021-01-06 10-45-15

This is the traceback, it is directly from the rl_loss line.

Our predictive model returns a numpy array, the discounted_reward is also a numpy value and not a tensor, I think that is the issue. So I'll try converting reward from numpy to a tensor, would that still lead to a disconnect in the graph?

@vmarar
Copy link
Author

vmarar commented Jan 6, 2021

I tried converting from numpy to tensor and then pointing it towards cuda:0, no error but model weights are still not changing

@Mariewelt
Copy link
Collaborator

I tried converting from numpy to tensor and then pointing it towards cuda:0, no error but model weights are still not changing

Yes, that wouldn't help, because you are first detaching the loss and then creating a new node that is not connected to the existing computational graph. In this scenario, the loss is not backpropagated and that's why weights are not changing.

Let me investigate the issue. Could you please provide your environment details? The versions of PyTorch, NVIDIA driver, and what GPU you are using.

@vmarar
Copy link
Author

vmarar commented Jan 6, 2021

Here are the env details:
GPU : 'GeForce RTX 2080'
Version of pytorch : 1.6.0
NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0

Details:
I have a scoring script that I am using in lieu of a predictive model, it outputs a numpy array. Currently I am using output_final = torch.tensor(output, device='cuda:0') to convert to a tensor, which got rid of rl_loss error. Please let me know if there's any other details I can provide.

Thanks for your help!

@vmarar
Copy link
Author

vmarar commented Jan 7, 2021

After trying some solutions, I found that after converting NumPy output to tensor output within my scoring script. AND addressing a warning of using a source tensor to create a copy in the rl_loss = torch.tensor(rl_loss, require_grad=True) line, I simply commented out this line, and weights seem to backpropagate.

Is there any reason it could be wrong to comment out that line and not create a tensor from the existing rl_loss tensor and backpropagate the grad from rl_loss = rl_loss/n_batch instead? It's working but I want to make sure that I am not backpropagating useless values.

Is it possible that by using a script in lieu of a predictive model , autograd only sees the output of the script which is in the form of a tensor? Therefore not leading to any breaks in the computational graph because it's only dealing with the output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants