Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"addmm_cuda" not implemented for 'Long' #1009

Closed
meanderingstream opened this issue Dec 12, 2022 · 6 comments
Closed

"addmm_cuda" not implemented for 'Long' #1009

meanderingstream opened this issue Dec 12, 2022 · 6 comments
Labels
area:torchx Applies to Torchx

Comments

@meanderingstream
Copy link

I have the Axon VAE notebook, fashionmnist_vae.livemd, running under Torchx CPU. I can regularly get the notebook to fail when executing the Enum.at line in the following:
{input_batch, target_batch} = Enum.at(train_data, 0)

It also fails with Enum.take(train_data,0)

The same notebook using XLA functions just fine.

Mix.install(
[
# {:exla, "> 0.4.0"},
# {:exla, "
> 0.4.1"},
{:torchx, "> 0.4.1"},
# {:nx, "
> 0.4.0", override: true},
{:nx, "> 0.4.1"},
{:axon, "
> 0.3.0"},
{:req, "> 0.3.1"},
{:kino, "
> 0.7.0"},
{:scidata, "> 0.1.9"},
{:stb_image, "
> 0.5.2"},
{:kino_vega_lite, "> 0.1.6"},
{:vega_lite, "
> 0.1.6"},
{:table_rex, "~> 3.1.1"}
],

system_env: %{"XLA_TARGET" => "cuda111"}

system_env: %{"LIBTORCH_TARGET" => "cu116"}
)

alias VegaLite, as: Vl

This speeds up all our Nx operations without having to use defn

Nx.global_default_backend(EXLA.Backend)

Nx.global_default_backend(Torchx.Backend)

I have Cuda Toolkit 11.8 and CudaDNN installed.

terminate called after throwing an instance of 'c10::Error'
what(): "addmm_cuda" not implemented for 'Long'
Exception raised from operator() at ../aten/src/ATen/native/cuda/Blas.cpp:311 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7fac60c452eb in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xce (0x7fac60c40cbe in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libc10.so)
frame #2: + 0x2e7ba41 (0x7fac0f27ba41 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cuda_cu.so)
frame #3: at::native::structured_mm_out_cuda::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) + 0x53 (0x7fac0f27bcc3 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cuda_cu.so)
frame #4: + 0x2bc09ac (0x7fac0efc09ac in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cuda_cu.so)
frame #5: + 0x2bc0a63 (0x7fac0efc0a63 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cuda_cu.so)
frame #6: + 0x1e5be32 (0x7fac3685be32 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #7: at::_ops::mm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) + 0x76 (0x7fac3685c3b6 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #8: + 0x3297ebf (0x7fac37c97ebf in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #9: + 0x3298d46 (0x7fac37c98d46 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #10: at::_ops::mm::call(at::Tensor const&, at::Tensor const&) + 0xdf (0x7fac368a621f in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #11: at::native::tensordot(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef) + 0xaff (0x7fac35f6a4df in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #12: + 0x229402b (0x7fac36c9402b in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #13: at::_ops::tensordot::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef) + 0x1a8 (0x7fac36692888 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/libtorch/libtorch_cpu.so)
frame #14: tensordot(enif_environment_t*, int, unsigned long const*) + 0x9f9 (0x7fad00871789 in /home/ml3/.cache/mix/installs/elixir-1.14.2-erts-13.1.2/75737351c79772535db35cdfe6072671/_build/dev/lib/torchx/priv/torchx.so)
frame #15: erts_call_dirty_nif + 0x1ec (0x560f5f300b0c in /home/ml3/.asdf/installs/elixir/1.14.2-otp-25/.mix/escripts/livebook)
frame #16: erts_dirty_process_main + 0x20b (0x560f5f17770b in /home/ml3/.asdf/installs/elixir/1.14.2-otp-25/.mix/escripts/livebook)
frame #17: + 0x6a815 (0x560f5f0cc815 in /home/ml3/.asdf/installs/elixir/1.14.2-otp-25/.mix/escripts/livebook)
frame #18: + 0x360520 (0x560f5f3c2520 in /home/ml3/.asdf/installs/elixir/1.14.2-otp-25/.mix/escripts/livebook)
frame #19: + 0x94b43 (0x7fad4623cb43 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: + 0x126a00 (0x7fad462cea00 in /lib/x86_64-linux-gnu/libc.so.6)

@josevalim
Copy link
Collaborator

I wonder how Python handles such cases. Do they check for the device and use separate operations?

@meanderingstream
Copy link
Author

Since I can regularly repeat this at the same section of the notebook, I don't think it is really about the device per se. This is the first point in the notebook where data is retrieve from a stream.

@meanderingstream
Copy link
Author

Since XLA executes the notebook just fine, it is something specific to Torchx.

@meanderingstream
Copy link
Author

My MatMul using Torchx 0.4.1 and cu116 works just fine. https://github.com/meanderingstream/dl_foundations_in_elixir/blob/main/01h_matmul_Torchx_gpu.livemd. That notebook doesn't use streams.

@josevalim
Copy link
Collaborator

Sorry, in this case I meant :gpu/:cuda as device. The operation is not implemented at the low-level for CUDA, so they have to handle it elsewhere. Usually by downcasting/upcasting before/after performing it.

@josevalim josevalim added the area:torchx Applies to Torchx label Dec 13, 2022
@josevalim josevalim changed the title Runtime node terminated unexpectedly - no connection "addmm_cuda" not implemented for 'Long' Jan 26, 2023
@josevalim
Copy link
Collaborator

I will go ahead and close this as a LibTorch bug. You either need to use a policy to downcast to a lower precision or LibTorch has to implement the relevant operation on Cuda.

@josevalim josevalim closed this as not planned Won't fix, can't repro, duplicate, stale May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:torchx Applies to Torchx
Projects
None yet
Development

No branches or pull requests

2 participants