Skip to content
This repository has been archived by the owner on Mar 6, 2023. It is now read-only.

Illegal instruction (core dumped) on Raspberry Pi 4B #8

Open
unwind opened this issue Jul 7, 2021 · 11 comments
Open

Illegal instruction (core dumped) on Raspberry Pi 4B #8

unwind opened this issue Jul 7, 2021 · 11 comments

Comments

@unwind
Copy link

unwind commented Jul 7, 2021

When running with the latest (1.9.0) wheel from here, as per the installation instructions, my project's Torch code crashes every time with an illegal instruction exception.

The top 10 stack levels looked like this:

(gdb) where
#0  0x0000ffffd286dfc8 in exec_blas ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#1  0x0000ffffd283f150 in gemm_driver ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#2  0x0000ffffd283fbd0 in sgemm_thread_nn ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#3  0x0000ffffd28385bc in sgemm_ () from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#4  0x0000ffffcfc38b8c in at::native::cpublas::gemm(at::native::cpublas::TransposeType, at::native::cpublas::TransposeType, long, lo
ng, long, float, float const*, long, float const*, long, float, float*, long) ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#5  0x0000ffffcfce5c48 in at::native::addmm_impl_cpu_(at::Tensor&, at::Tensor const&, at::Tensor, at::Tensor, c10::Scalar const&, c1
0::Scalar const&) () from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#6  0x0000ffffcfce68d0 in at::native::mm_cpu_out(at::Tensor const&, at::Tensor const&, at::Tensor&) ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#7  0x0000ffffcfce6a34 in at::native::mm_cpu(at::Tensor const&, at::Tensor const&) ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#8  0x0000ffffd056c784 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFuncti
onPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_mm>, at::Ten
sor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call
(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#9  0x0000ffffd039b464 in at::redispatch::mm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) ()
   from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
#10 0x0000ffffd1b5659c in torch::autograd::VariableType::(anonymous namespace)::mm(c10::DispatchKeySet, at::Tensor const&, at::Tenso
r const&) () from lib/python3.8/site-packages/torch/lib/libtorch_cpu.so

Looking at the disassembly at the indicated location I got:

(gdb) disassemble
Dump of assembler code for function exec_blas:
   0x0000ffffd286df70 <+0>:     adrp    x2, 0xffffd40c6000
   0x0000ffffd286df74 <+4>:     stp     x29, x30, [sp, #-80]!
   0x0000ffffd286df78 <+8>:     mov     x29, sp
   0x0000ffffd286df7c <+12>:    ldr     x3, [x2, #2376]
   0x0000ffffd286df80 <+16>:    mov     x2, x0
   0x0000ffffd286df84 <+20>:    stp     x19, x20, [sp, #16]
   0x0000ffffd286df88 <+24>:    mov     x20, x1
   0x0000ffffd286df8c <+28>:    ldr     w0, [x3]
   0x0000ffffd286df90 <+32>:    cbz     w0, 0xffffd286e000 <exec_blas+144>
   0x0000ffffd286df94 <+36>:    cmp     x2, #0x0
   0x0000ffffd286df98 <+40>:    ccmp    x20, #0x0, #0x4, gt
   0x0000ffffd286df9c <+44>:    b.eq    0xffffd286dff0 <exec_blas+128>  // b.none
   0x0000ffffd286dfa0 <+48>:    adrp    x19, 0xffffd4150000 <memory+1984>
   0x0000ffffd286dfa4 <+52>:    add     x1, sp, #0x38
   0x0000ffffd286dfa8 <+56>:    add     x4, x19, #0x4f0
   0x0000ffffd286dfac <+60>:    mov     w0, #0x1                        // #1
   0x0000ffffd286dfb0 <+64>:    add     x4, x4, #0x40
   0x0000ffffd286dfb4 <+68>:    nop
   0x0000ffffd286dfb8 <+72>:    nop
   0x0000ffffd286dfbc <+76>:    nop
   0x0000ffffd286dfc0 <+80>:    strb    wzr, [sp, #56]
   0x0000ffffd286dfc4 <+84>:    mov     w3, #0x0                        // #0
=> 0x0000ffffd286dfc8 <+88>:    casalb  w3, w0, [x4]
   0x0000ffffd286dfcc <+92>:    cbnz    w3, 0xffffd286dfc0 <exec_blas+80>
   0x0000ffffd286dfd0 <+96>:    adrp    x0, 0xffffd286d000 <inner_thread+2192>
   0x0000ffffd286dfd4 <+100>:   stp     x2, x20, [sp, #56]
   0x0000ffffd286dfd8 <+104>:   add     x0, x0, #0xbcc
   0x0000ffffd286dfdc <+108>:   str     xzr, [sp, #72]

This seems to indicate the the culprit is the CASALB instruciton, which as far as I can understand is ARM8.1, while the Raspberry Pi has an ARM8-compliant core.

I hope this can be fixed, since building Torch myself seems daunting (and also since, assuming I'm right above, this is not the intended behavior).

Thanks for making tihs avaialble.

@KumaTea
Copy link
Owner

KumaTea commented Jul 7, 2021

Hi.

Currently PyTorch wheels for Python 3.6 - 3.9 are installed from the official PyPI source.
It's very likely that the PyTorch team used some enterprise cloud VM with ARM CPUs, which are ARMv8.2 based, so that wheels won't disable v8.1 instructions.

Do you have any sample code?
I don;t know much of C, but I'll try to compile from source if the problem reproduces, in a few days (my borad is not available this week).

Thanks!

@unwind
Copy link
Author

unwind commented Jul 7, 2021

Hi.

Thanks for the rapid response. Not sure if I have code I can share, perhaps I can stitch together something but it will be a while. This is my last week before going on vacation, and there's other things to do in the project.

Thanks!

@unwind
Copy link
Author

unwind commented Jul 7, 2021

Hi again.

Okay, here's an attempt at a reproduction case:

#!/bin/usr/env python3
import torch
w_bn = torch.randn(64,64)
w_conv = torch.randn(64,108)
w = torch.randn(64, 12, 3, 3)
w.copy_(torch.mm(w_bn, w_conv).view(w.size()))

This crashes with a core dump every time I run it. Apologies for the random-seeming dimensions, it's just what our project seemed to be using (I'm not the author if the PyTorch-using code in our project, so I lack deeper understanding).

I did not trace this down to the core dump, but I would say chances are pretty good this is the same crash. I now understand that the "mm" in the trace above refers to a matrix multiplication, and this line of code (which is the same as in our project, except of course the data has been replaced with random matrices) calls mm() and never returns.

Good luck!

@KumaTea
Copy link
Owner

KumaTea commented Jul 7, 2021

Hi, thank you for your replies!

I just tried your sample code on Python 3.8 (official wheel) and Python 3.10 (wheel from this repo). The results are:

On Python 3.8, bash reported Illegal instruction (core dumped) and exited,
and fish reported fish: Job 1, “python3” terminated by signal SIGILL (Illegal instruction) then exited.

On Python 3.10, it printed

tensor([[[[ 1.0143e+01, -1.2882e+01, -1.1660e+00],
          [ 1.1609e+01,  7.4942e+00,  3.3680e-01],
          [ 8.7291e+00,  1.7029e+01, -1.6758e+01]],

         ...

successfully.

I don't really understand what the codes mean, but I think this might suggest the assumption above.


I'll build wheels for 3.6 - 3.9 asap. Thank you again!

@unwind
Copy link
Author

unwind commented Jul 7, 2021

Okay great, feel free to drop me a line when you have wheels available and hopefully I can test, too.

Thanks!

@KumaTea
Copy link
Owner

KumaTea commented Jul 11, 2021

Hi @unwind, the wheels are updated. You may try if if works(link for Python 3.8 here)!

@KannebTo
Copy link

KannebTo commented Aug 2, 2021

Hi,
I have the same problem on my Raspberry Pi 4B with python 3.8.10.
Sadly after installing the updated wheel, the situation is the same.

@KumaTea
Copy link
Owner

KumaTea commented Aug 3, 2021

Hi,
I have the same problem on my Raspberry Pi 4B with python 3.8.10.
Sadly after installing the updated wheel, the situation is the same.

Hi, could you provide your error report(s) and sample code?

I've tried on the wheel with the code above, but it worked normally. Could it be some difference in your code that cause the problem?

Thanks!

@KannebTo
Copy link

KannebTo commented Aug 3, 2021

hmm. I tested the same code and it is the problematic casalb instruction. Official torch version 1.8.1 is working.

Oh, I see. After upgrading again it is working. Maybe not uninstalling official 1.9.0 before installing this version was the problem.
Thanks!

@unwind
Copy link
Author

unwind commented Aug 20, 2021

Hi!

It does seem to resolve the issue for me on my Raspberry target. I had to (as you say) download your wheel manually and pip install it directly from the file, but that was expected and worked well.

Thanks!

@AbdelsalamHaa
Copy link

AbdelsalamHaa commented Dec 13, 2022

Hi, I have the same issue here. But it happen when I try to feed an image to my model. I'm using one of the existing models in pytorch.

I tried to download the wheel manually and install it, I got an error indicating that.
torch-1.9.0-cp310-cp310-linux_aarch64.whl is not a supported wheel on this platform. I tired with cp36 linux and cp36 many linux but all are the same thing.

I have tried multiple models one of them is as follows
net = models.quantization.mobilenet_v2(pretrained=True)

@unwind perhaps you can let me know which wheel file did u try. Thanks

I'm using python 3.9.2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants