-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal instruction 4 in 1.3, OSX & GPU #27627
Comments
Seems bad. We should repro and fix this. |
unfortunately a fix for this is not making into v1.3.0. |
Any way I can help reproduce or diagnose this? |
Just want to double check: the error also occurs if you don't valgrind, is that right? (We had some issues with valgrind in the past where valgrind didn't understand an instruction but it was fine.) @elbamos is it possible to do a bisect while cherry-picking the other commit? |
Yes. I’m compiling with debug info and running valgrind to provide the issue with something to go on.
… On Oct 11, 2019, at 10:16 AM, Edward Z. Yang ***@***.***> wrote:
Just want to double check: the error also occurs if you don't valgrind, is that right? (We had some issues with valgrind in the past where valgrind didn't understand an instruction but it was fine.)
@elbamos is it possible to do a bisect while cherry-picking the other commit?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Removing triaged as this wasn't triaged. |
@ezyang I tried to bisect this. Its coming from the named tensor code; that's why it wasn't showing up until named tensors were permanently activated. Trying to go further back and compiling with named tensors enabled, its present in 44bd63c at the end of August. Going back further than that, and we get to a place in the code where there are layers of other errors that prevent compilation or loading. What else can I do to help you track this down? |
I find it hard to believe that named tensor would cause this. In any case, at this point, I think one of us has to repro it. |
What can I do to convince you?
… On Oct 15, 2019, at 6:43 AM, Edward Z. Yang ***@***.***> wrote:
I find it hard to believe that named tensor would cause this. In any case, at this point, I think one of us has to repro it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I can't reproduce this locally with the built binary (does that crash for you, @elbamos?). I can't reproduce this while building from source, but I don't have the same configuration (I'm using Apple LLVM version 10.0.1, Python 3.6.7, and no CUDA) |
No, it doesn’t crash unless it’s been compiled with gpu support.
That isn’t surprising - from the valgrind output, the error occurs when the cuda libraries are being loaded. If I included more of the valgrind trace, you can see which of the libraries have loaded successfully. Would that be helpful to you?
… On Oct 15, 2019, at 9:10 AM, Richard Zou ***@***.***> wrote:
I can't reproduce this locally with the built binary (does that crash for you, @elbamos?). I can't reproduce this while building from source, but I don't have the same configuration (I'm using Apple LLVM version 10.0.1, Python 3.6.7, and no CUDA)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Here's the full valgrind output... |
I wish we had an OS X machine with GPU machines locally, this would make it a lot easier XD. (I guess there's no chance of this happening on Linux, is there...) |
I reviewed the valgrind output, and while I agree that there are CUDA related errors in the output, the actual illegal instruction is, IMO, unrelated. The backtrace for the illegal instruction is clearly related to our vectorization support. Since we're having trouble reproducing, what I would like to see is if, when you do a build with CUDA disabled, the same error occurs. I think it will! If it doesn't that would be very interesting information. |
You're correct - I do get the same error without cuda. (Valgrind output attached.) I also tried the distribution on conda, and get the same error from that. So, it has to be conflicting with something installed somewhere on my system. If you have any advice for tracking it down, I'd appreciate it. Regarding an OSX machine with GPU, those are pretty cheap these days, because they're mostly 7+ years old. In fact, I have an old Mac Pro I'd contribute to the cause, but you'd have to put a GPU in it. |
I looked a little further. I note this line from valgrind: |
Bingo. We do detection of cpu capabilities using the cpuinfo library. https://github.com/pytorch/cpuinfo You might be able to work out what the bug in the detection here is. If you need a cheap and cheerful workaround you could edit aten/src/ATen/native/DispatchStub.h to just never consider AVX2 |
What more can I do to be helpful? |
Ping :) |
I'm curious if you check out that commit and begin deleting parts of it to see what exactly would cause the error (crash on import torch) to go away. One thing to try is to delete the std::all_of call. |
I tried deleting the call to |
From the valgrind output, the error occurs in One candidate could be
In there we see one global variable being initialised: pytorch/aten/src/ATen/Dimname.h Line 30 in 8ffcbfb
@elbamos could you try running from peterbell10@1a2b2de to see if it changes anything? |
@peterbell10 Yup, that did clear it out. Torch loads properly when I replace Dimname.h and .cpp with the versions in that commit. |
Applying @peterbell10's approach to v1.3.0, I'm able to get a good and working install. |
Very nice catch! Static variable in the header is definitely wrong. @peterbell10 are you going to PR your fix? |
I can submit a PR but I'm not entirely convinced this is the whole story and not just a work-around. A static variable in a header is odd but I don't think it should be causing a To say for sure, @elbamos could try running the dimname tests to see if it fails on |
@peterbell10 I ran I agree with you this can't be the whole story. In particular, I don't see why the inclusion of that symbol should cause compilation to use AVX instructions. For one thing, that static variable isn't doing anything that should implicate AVX. For another, the build should be detecting that the cpu doesn't support AVX, as it does for the remainder of the codebase. To me, that's the mystery, why the build is not correctly detecting AVX support for this part of the codebase. |
Okay, I think I see what's going on now. |
That sounds like the rest of the story! Please let me know if you want me to test another patch. |
@peterbell10 Running from your branch, I get the onnx-protobuf errors reported in #26945. Applying the files in #29384 to v1.3.0, I get a clean compile and a working binary. |
@ezyang Do you still have an interest in the Mac box? My wife wants it out of our kitchen anyway, so all you have to do is send an uber to pick it up. |
Summary: Fixes pytorch/pytorch#27627 The variable being declared in a header as `static` meant that the global variable is initialized in every source file that includes it. This is particularly problematic when included in AVX source files as it generates SIGILL on older hardware. Pull Request resolved: pytorch/pytorch#29384 Differential Revision: D18380379 Pulled By: zou3519 fbshipit-source-id: 0dcd87db01c468a5c9ddb2c695528b85ed2e1504
haha, I'll defer that to @soumith XD |
haha, we should probably pass, considering how many unused boxes are under the common desk |
I can't get it working by simply installing #29967. I don't need GPU support but just a way to install it. I tried with every version, even 0.4.1 but I always get the same error. Do I have to build it? |
Dears, I still have the problem with pytorch-1.4.0:
|
It works fine with conda-forge install. Pip or conda versions seem brocken. |
@Christux there is no conda-forge package for pytorch 1.4.0, so that doesn't quite make sense. Can you please provide reproducible instructions of how to trigger this, including your exact install commands? |
**step 1: classic install**
(see next comment for details)
|
Thanks @Christux, that helps. Email replies don't support Markdown on GitHub, so I'll copy and reformat your reply here: step 1: classic install
Output:
step 2: Migration to conda forge
Output:
|
Okay that clears up one issue:
looks like a bug in the logging output of Then, you don't give the output from
for me. Side note: that changing from |
🐛 Bug
Upon
import torch
, I get Illegal instruction 4.This issue appears to have arisen sometime after 1.2 was released, but it was hard to nail down because of the issue resolved in #27583. For the same reason, I haven't been able to track it to a particular PR.
Compiling with XCode 9.4.1, CUDA 10, Cudnn 7.6.4, Python 3.6.
To Reproduce
Steps to reproduce the behavior:
import torch
Relevant portion of valrgind output:
Expected behavior
Not terminate because of illegal instruction.
Environment
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
OS: Mac OSX 10.13.6
GCC version: Could not collect
CMake version: version 3.15.4
Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.0.130
GPU models and configuration: Could not collect
Nvidia driver version: 1.1.0
cuDNN version: Probably one of the following:
/usr/local/cuda/lib/libcudnn.7.dylib
/usr/local/cuda/lib/libcudnn_static.a
Versions of relevant libraries:
[pip] 19.2.3
[conda] mkl 2019.4 233
[conda] mkl-service 2.3.0 py36hfbe908c_0
[conda] mkl_fft 1.0.14 py36h5e564d8_0
[conda] mkl_random 1.1.0 py36ha771720_0
Additional context
cc @ezyang @gchanan @zou3519 @VitalyFedyunin
The text was updated successfully, but these errors were encountered: