[torch.export] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #126674

siahuat0727 · 2024-05-20T08:56:01Z

🐛 Describe the bug

The exported model failed to do inference on cuda.

import torch
ep = torch.export.load('retina.pt2')
gm = ep.module()
gm(torch.rand(1, 3, 800, 1216))  # success

gm = ep.module().cuda()
gm(torch.rand(1, 3, 800, 1216).cuda())  # failed

retina.pt2

Maybe related to #121761 but the solution provided by this comment doesn't work.
@angelayi Could you have a look at this? Thank you.

(Sorry for not providing the code about exporting the model now because it's a bit complicated)

Versions

Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.27

Python version: 3.8.19 (default, Mar 20 2024, 19:58:24)

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang

The text was updated successfully, but these errors were encountered:

angelayi · 2024-05-20T15:25:01Z

Can you please share the error message and what was the issue when you tried applying the suggestion in the comment?

siahuat0727 · 2024-05-21T00:50:19Z

Sure. Thanks for looking at this.

Traceback (most recent call last):
  File "/home/chensf/git/mmdeploy_export_onnx/reproduce_cuda_error_bug.py", line 7, in <module>
    gm(torch.rand(1, 3, 800, 1216).cuda())  # failed
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/fx/graph_module.py", line 737, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/fx/graph_module.py", line 317, in __call__
    raise e
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/fx/graph_module.py", line 304, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)
  File "<eval_with_key>.7", line 817, in forward
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/_ops.py", line 595, in __call__
    return self_._op(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

The error message when I tried to apply the suggestion is the same.

Traceback (most recent call last):
  File "/home/chensf/git/mmdeploy_export_onnx/reproduce_cuda_error_bug.py", line 20, in <module>
    gm(torch.rand(1, 3, 800, 1216).cuda())  # failed
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/fx/graph_module.py", line 737, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/fx/graph_module.py", line 317, in __call__
    raise e
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/fx/graph_module.py", line 304, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
    result = forward_call(*args, **kwargs)
  File "<eval_with_key>.7", line 817, in forward
  File "/root/miniconda3/envs/open-mmlab/lib/python3.10/site-packages/torch/_ops.py", line 595, in __call__
    return self_._op(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

And this is the code to try the suggestion:

import torch
ep = torch.export.load('retina.pt2')
gm = ep.module()
gm(torch.rand(1, 3, 800, 1216))  # success

for node in ep.graph.nodes:
    if "device" in node.kwargs:
        kwargs = node.kwargs.copy()
        kwargs["device"] = "cuda"
        node.kwargs = kwargs

# Move state dict tensors to cuda
for k, v in ep.state_dict.items():
    if isinstance(v, torch.nn.Parameter):
        ep._state_dict[k] = torch.nn.Parameter(v.cuda())
    else:
        ep._state_dict[k] = v.cuda()

gm = ep.module()
gm(torch.rand(1, 3, 800, 1216).cuda())  # failed

Also note that the link to this model is provided at the above issue.

malfet added oncall: pt2 module: export labels May 20, 2024

xmfan added needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user and removed needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels May 20, 2024

mlazos added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch.export] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #126674

[torch.export] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #126674

siahuat0727 commented May 20, 2024 •

edited by pytorch-bot bot

angelayi commented May 20, 2024

siahuat0727 commented May 21, 2024

[torch.export] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #126674

[torch.export] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #126674

Comments

siahuat0727 commented May 20, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

angelayi commented May 20, 2024

siahuat0727 commented May 21, 2024

siahuat0727 commented May 20, 2024 •

edited by pytorch-bot bot