-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizer load_state_dict() problem? #2830
Comments
Could you provide a full script to reproduce the problem? |
maybe you can try like this, |
Sorry, I missed the reply email. I am afraid that I am unable to provide a reproducer now. It is a work I am doing for the OpenNMT-py project:https://github.com/OpenNMT/OpenNMT-py, trying to use I've tried several methods, including tricks like what @hefeicyp suggests, but it still happens. Per my analysis, it is because the previous training was done on gpu, so when saving the |
Try moving optimizer state to the GPU memory manually after loading it from the checkpoint. optimizer = optim.Adam()
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.cuda() I agree that having an |
@dogancan, thanks. My work was suspended due to other problems, when resumed, I will try your method. |
I'm afraid @dogancan's solution won't work. It will make the error go away, but your optimizer will no longer be training the model. You should recreate optimizers after casting modules to a different type or device, and you can use |
@apaszke , yep, your method is what I currently use, it works. But I will wait for upstream to fix this problem though. Thanks for your great works! |
@apaszke Ah, my bad. I forgot to update the line where the optimizer is recreated. But otherwise, the following should do the job, right? model = Model()
model.load_state_dict(checkpoint['model'])
model.cuda()
optimizer = optim.Adam(model.parameters())
optimizer.load_state_dict(checkpoint['optimizer'])
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.cuda() |
ah, right. That should work 😊 |
Except that you should use |
I had a similar problem. When I save the optimizer state from a GPU other than GPU 0 and then load the state it still loads everything to GPU 0. Specifying |
Hi guys, I have a very similar problem as the one in this thread, here's my code:
And then once I resume, I got KeyErrors on my optimizer:
Do you guys know how to fix this issue? BTW, I have 8 GPUs used, I'm guessing if this issue was because of that? |
@CodArs-van were you able to solve your issue with multiple-GPUs? |
@rafaelvalle Thanks for asking. Yeah, I'm able to, turns out the issue is because I used an early version of PyTorch, after I updated the version, it works like a charm! |
Just a comment, this problem is caused by def load_state_dict(self, state_dict):
...
# deepcopy, to be consistent with module API
state_dict = deepcopy(state_dict)
...
|
pytorch/pytorch#2830 1. Recreating the optimizer, using the model parameters 2. Loading the optimizer state saved from the checkpoint to the optimizer. modified: onmt/Optim.py modified: train.py
fix turned out not be correct. It is still nescessary to (re-)create the optimizer at all times, using the state information. But in case of loading an optimizer from a checkpoint, in a second stage the saved optimizer state dictionary must be used with the re-created optimizer to set the optimizer.state field. In case of Adam for example, this is what restores the parameter history from the previous epoch, which was previously lost because the second step was not done. As one additional last thing for this fix to work, if the GPU is used, the relevant restored optimizer state variables must be converted to their CUDA counterpart. Note that this fix was inspired on a fix for a similar problem, discussed at pytorch/pytorch#2830 modified: train.py
Hi @lzcn, how do you know the specific GPU location of different tensors in advance? |
Would a feature where all torch.save() calls always makes use of an automatically generated CPU version be feasible ? |
I have met similar problem, I recreated Adam optimizer without optimizer.cuda() after reloading model, model.cuda() and DataParallel(model) according to @dogancan's solution. |
thanks, it work!
|
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` with torch.no_grad(): correct = 0 total = 0 for images, labels in test_loader: images = images.to(device) # missing line from original code labels = labels.to(device) # missing line from original code images = images.reshape(-1, 28 * 28) out = model(images) _, predicted = torch.max(out.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() ```
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` model = Model() model.load_state_dict(checkpoint['model']) model.cuda() optimizer = optim.Adam(model.parameters()) optimizer.load_state_dict(checkpoint['optimizer']) for state in optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.cuda() ```
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` with torch.no_grad(): correct = 0 total = 0 for images, labels in test_loader: images = images.to(device) # missing line from original code labels = labels.to(device) # missing line from original code images = images.reshape(-1, 28 * 28) out = model(images) _, predicted = torch.max(out.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() ```
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` model = Model() model.load_state_dict(checkpoint['model']) model.cuda() optimizer = optim.Adam(model.parameters()) optimizer.load_state_dict(checkpoint['optimizer']) for state in optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.cuda() ```
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` with torch.no_grad(): correct = 0 total = 0 for images, labels in test_loader: images = images.to(device) # missing line from original code labels = labels.to(device) # missing line from original code images = images.reshape(-1, 28 * 28) out = model(images) _, predicted = torch.max(out.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() ```
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` model = Model() model.load_state_dict(checkpoint['model']) model.cuda() optimizer = optim.Adam(model.parameters()) optimizer.load_state_dict(checkpoint['optimizer']) for state in optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.cuda() ```
@apaszke model = Model()
model.cuda()
optimizer = optim.Adam(model.parameters())
for d, gt in trn_dataloader:
# train
...
optimizer.step()
model.cpu() # move to cpu
# eval or do other things
...
model.cuda() # but finnally, move back does optimizer run as expected? also, if doing |
After loading an optimizer originally saved on GPU, there seems to be a device mismatch issue. Solution has been adapted from [here](pytorch/pytorch#2830 (comment))
@apaszke Is there a problem if you switch the order to something like this?
Meaning moving the model to 'cuda' but only loading it's state dict from checkpoint after loading the optimizer's state dict first? |
The problem can be concluded that the optimizer's state will be loaded to the device as same as the model. You must load the model to GPU at first, and then load the optimizer's state. So that both the model and the optimizer's state are loaded in GPU. |
Instead of moving optimizer to cuda after loading it in cpu, you could load the checkpoint directly in cuda: model.to(device)
ckpt = torch.load(<model_path>, map_location=device)
model.load_state_dict(ckpt['state_dict'])
optimizer.load_state_dict(ckpt['optimizer'])
scheduler.load_state_dict(ckpt['scheduler'])
del ckpt |
I've independently rediscovered that this works :) Should read until the end of the thread next time 😅 |
I find my codes still have the problem. I tried my best to range the modules as the examples shown in the above. can anyone give me some hints?
I also tried this
|
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` with torch.no_grad(): correct = 0 total = 0 for images, labels in test_loader: images = images.to(device) # missing line from original code labels = labels.to(device) # missing line from original code images = images.reshape(-1, 28 * 28) out = model(images) _, predicted = torch.max(out.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() ```
If you don't configure this string of code, you will get an error when you iterate over the update from 4000_checkpoint.tar: ``` encoder_optimizer.step() ``` Error message: ``` exp_avg.mul_(beta1).add_(1 - beta1, grad) RuntimeError: expected backend CPU and dtype Float but got backend CUDA and dtype Float ``` Fix it: pytorch/pytorch#2830 ``` model = Model() model.load_state_dict(checkpoint['model']) model.cuda() optimizer = optim.Adam(model.parameters()) optimizer.load_state_dict(checkpoint['optimizer']) for state in optimizer.state.values(): for k, v in state.items(): if isinstance(v, torch.Tensor): state[k] = v.cuda() ```
…t loading when training on cuda (pytorch/pytorch#2830 (comment))
For me, I create the optimizer, load the state (with map location to CUDA), pass to the train loop, where the mode is pushed to device (though I assume it's not needed if the loaded model is pushed to cuda by map_location). Specifically, I would get stuck at |
Hi, I encountered this bug:
The code skeleton is like:
It seems the loaded
param_groups
aretorch.cuda.FloatTensor
, and I've tried a workaround tomove
optmizer.param_groups
tocpu
, but it still has the same bug.The text was updated successfully, but these errors were encountered: