New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation Fault when running with HSA #4217
Comments
Can you run with gdb and share backtrace (py-bt)? |
Yeah looks like a conda issue. Here is the stack trace. GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...
Starting program: /home/prairie/anaconda3/envs/tiny/bin/python -c from\ tinygrad\ import\ Tensor\;'
'N\ =\ 1024\;\ a,\ b\ =\ Tensor.rand\(N,\ N\),\ Tensor.rand\(N,\ N\)\;'
'c\ =\ \(a.reshape\(N,\ 1,\ N\)\ \*\ b.T.reshape\(1,\ N,\ N\)\).sum\(axis=2\)\;'
'print\(\(c.numpy\(\)\ -\ \(a.numpy\(\)\ @\ b.numpy\(\)\)\).mean\(\)\)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3fff640 (LWP 11711)]
[New Thread 0x7ffff37fe640 (LWP 11712)]
[New Thread 0x7ffff0ffd640 (LWP 11713)]
[New Thread 0x7fffee7fc640 (LWP 11714)]
[New Thread 0x7fffebffb640 (LWP 11715)]
[New Thread 0x7fffe97fa640 (LWP 11716)]
[New Thread 0x7fffe6ff9640 (LWP 11717)]
[New Thread 0x7fffe47f8640 (LWP 11718)]
[New Thread 0x7fffe1ff7640 (LWP 11719)]
[New Thread 0x7fffdf7f6640 (LWP 11720)]
[New Thread 0x7fffdcff5640 (LWP 11721)]
[New Thread 0x7fffda7f4640 (LWP 11722)]
[New Thread 0x7fffd7ff3640 (LWP 11723)]
[New Thread 0x7fffd57f2640 (LWP 11724)]
[New Thread 0x7fffd2ff1640 (LWP 11725)]
[New Thread 0x7fffce7f0640 (LWP 11726)]
[New Thread 0x7fffcbfef640 (LWP 11727)]
[New Thread 0x7fffcb7ee640 (LWP 11728)]
[New Thread 0x7fffc8fed640 (LWP 11729)]
[New Thread 0x7fffc67ec640 (LWP 11730)]
[New Thread 0x7fffc3feb640 (LWP 11731)]
[New Thread 0x7fffc17ea640 (LWP 11732)]
[New Thread 0x7fffbefe9640 (LWP 11733)]
[New Thread 0x7fffbc7e8640 (LWP 11734)]
[New Thread 0x7fffb9fe7640 (LWP 11735)]
[New Thread 0x7fffb77e6640 (LWP 11736)]
[New Thread 0x7fffb4fe5640 (LWP 11737)]
[New Thread 0x7fffb27e4640 (LWP 11738)]
[New Thread 0x7fffaffe3640 (LWP 11739)]
[New Thread 0x7fffad7e2640 (LWP 11740)]
[New Thread 0x7fffa8fe1640 (LWP 11741)]
[New Thread 0x7fffa67e0640 (LWP 11742)]
[New Thread 0x7fffa5fdf640 (LWP 11743)]
[New Thread 0x7fffa37de640 (LWP 11744)]
[New Thread 0x7fffa0fdd640 (LWP 11745)]
[New Thread 0x7fff9e7dc640 (LWP 11746)]
[New Thread 0x7fff9bfdb640 (LWP 11747)]
[New Thread 0x7fff977da640 (LWP 11748)]
[New Thread 0x7fff96fd9640 (LWP 11749)]
[New Thread 0x7fff947d8640 (LWP 11750)]
[New Thread 0x7fff91fd7640 (LWP 11751)]
[New Thread 0x7fff8f7d6640 (LWP 11752)]
[New Thread 0x7fff8cfd5640 (LWP 11753)]
[New Thread 0x7fff8a7d4640 (LWP 11754)]
[New Thread 0x7fff87fd3640 (LWP 11755)]
[New Thread 0x7fff837d2640 (LWP 11756)]
[New Thread 0x7fff82fd1640 (LWP 11757)]
[New Thread 0x7fff807d0640 (LWP 11758)]
[New Thread 0x7fff7bfcf640 (LWP 11759)]
[New Thread 0x7fff7b7ce640 (LWP 11760)]
[New Thread 0x7fff78fcd640 (LWP 11761)]
[New Thread 0x7fff747cc640 (LWP 11762)]
[New Thread 0x7fff71fcb640 (LWP 11763)]
[New Thread 0x7fff717ca640 (LWP 11764)]
[New Thread 0x7fff6efc9640 (LWP 11765)]
[New Thread 0x7fff6c7c8640 (LWP 11766)]
[New Thread 0x7fff69fc7640 (LWP 11767)]
[New Thread 0x7fff697c6640 (LWP 11768)]
[New Thread 0x7fff64fc5640 (LWP 11769)]
[New Thread 0x7fff627c4640 (LWP 11770)]
[New Thread 0x7fff5ffc3640 (LWP 11771)]
[New Thread 0x7fff5b7c2640 (LWP 11772)]
[New Thread 0x7fff5afc1640 (LWP 11773)]
[New Thread 0x7fff4bfff640 (LWP 11774)]
[New Thread 0x7fff4b7fe640 (LWP 11775)]
[Thread 0x7fff4b7fe640 (LWP 11775) exited]
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
P_set (ptr=0x7ffe3ae00000, value=<optimized out>, size=<optimized out>) at /usr/local/src/conda/python-3.11.8/Modules/_ctypes/cfield.c:1463
1463 /usr/local/src/conda/python-3.11.8/Modules/_ctypes/cfield.c: No such file or directory. Is there a way to use it with conda, so that it works within an env? |
I ran into this issue outside conda, just using pip install git+https.... Copyright (C) 2022 Free Software Foundation, Inc. (gdb) run Thread 1 "python3" received signal SIGSEGV, Segmentation fault. |
Using version 0.8.0 using pip, works. |
@nimlgen I have been looking through the code base to try to see how it is generating the source of "/usr/local/src/conda/python-3.11.8/Modules/_ctypes/cfield.c", but have been unsuccessful. Any ideas on where to start debugging? I would prefer to use the main branch of tinygrad rather than 0.8.0. |
is Do you have an integrated gpu? You can manage visible gpus with https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#rocr-visible-devices if it is the case. |
kernargs isn't zero for me. I have a dedicated GPU, 7900 XTX. (tiny) prairie@TRX40:~/Projects$ DEBUG=3 python3 -c "from tinygrad import Tensor;
N = 1024; a, b = Tensor.rand(N, N), Tensor.rand(N, N);
c = (a.reshape(N, 1, N) * b.T.reshape(1, N, N)).sum(axis=2);
print((c.numpy() - (a.numpy() @ b.numpy())).mean())"
opening device METAL from pid:8864
opening device HSA from pid:8864
opening device NPY from pid:8864
*** CUSTOM 1 custom_random arg 1 mem 0.00 GB
*** CUSTOM 2 custom_random arg 1 mem 0.01 GB
0 ━┳ STORE MemBuffer(idx=0, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1), strides=(1024, 1, 0), offset=0, mask=None, contiguous=True),)))
1 ┗━┳ SUM ((2,), dtypes.float)
2 ┗━┳ MUL
3 ┣━━ LOAD MemBuffer(idx=1, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1024), strides=(1024, 0, 1), offset=0, mask=None, contiguous=False),)))
4 ┗━━ LOAD MemBuffer(idx=2, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1024), strides=(0, 1, 1024), offset=0, mask=None, contiguous=False),)))
> /home/prairie/Projects/tinygrad/tinygrad/runtime/ops_hsa.py(88)__call__()
-> kernargs = self.device.alloc_kernargs(self.kernargs_segment_size)
(Pdb) n
> /home/prairie/Projects/tinygrad/tinygrad/runtime/ops_hsa.py(89)__call__()
-> args_st = self.args_struct_t.from_address(kernargs)
(Pdb) p kernargs
130574266138624
(Pdb) |
I could be wrong, but it just seems like that pointer for the function definition of |
Running the following with a fresh conda environment with python 3.11. CPU 3970X Threadripper. GPU 7900 XTX. Ubuntu 22.04.1 .
Results in.
The text was updated successfully, but these errors were encountered: