New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 and MKL #20356
Comments
Small update: Changing |
@seiko2plus any thoughts? |
Very confusing, I wonder if there could be any relation to compiler versions? Is it possible that OpenBLAS was missing some initializations in its kernels and we bumped the OpenBLAS version? Or is this with a self-compiled NumPy where the BLAS would be identical? |
I installed both 1.21.2 and 1.21.4 using "conda install numpy==...." and my versions were as follows: numpy == 1.21.2$ conda list -n np1212 | grep -i 'blas\|numpy'
blas 1.0 mkl
numpy 1.21.2 py39h20f2e39_0
numpy-base 1.21.2 py39h79a1101_0 numpy == 1.21.4
So I do indeed have very different versions of BLAS installed in the two cases. Perhaps this means that the real difference is the two versions of BLAS and not the two numpy versions. |
hmm, so maybe a mkl bug? |
The example works fine with numpy 1.20.3 and mkl: numpy == 1.20.3 and mkl blas
So if it is an mkl issue, it's an issue that wasn't triggered by 1.20.3. |
Interesting, because if this is MKL, it does look similar to: gh-20233 |
@hodgestar I couldn't replicate the bug. I am running this on a Intel SKX processor with AVX512 and same packages as you mention:
Am I missing something? |
@r-devulap Are you also using Python 3.9.7 (default, Sep 16 2021, 13:09:58) and GCC 7.5.0? In case it makes a difference, I'm running it on Linux machine. |
Funny, looks like it doesn't fail on Python 3.8.12. Here are the steps I followed:
Run your script and it doesn't fail. Repeat the same with Python 3.9.7:
And I see the error:
|
@r-devulap I'm glad you could reproduce it! This is definitely a tricky bug. :| |
A colleague encountered a perhaps related error some time ago. The minimal reproducer he found was: # eig.py
import numpy as np
import scipy.linalg
x = np.random.rand(25, 25) + 1j*np.random.rand(25, 25)
scipy.linalg.eig(x) This appeared on macOS (but maybe elsewhere too) and sometimes segfaulted. When run with debug malloc like so, it would sometimes produce a warning:
And valgrind lke so:
Would sometimes give an error. See qutip/qutip#1160 for full details (search for the "Minimal reproducer" heading). I'm not 100% sure it's related, but it is another data point. |
Maybe it is time to push this to MKL? I suppose a pure We are now at similar issues with different calls and two different interfaces (since I don't think scipy/numpy share a code here?). |
@seberg I would like the possibility that this is a numpy bug taken more seriously. Yes, it definitely could be entirely MKL doing something badly, but it works fine in numpy 1.20 (I should have perhaps made that a bit clearer in my bug report) and MKL works fine without the call to The fact that the Python version makes a difference also argues for MKL not doing something egregiously wrong and the error being more complex than that. I'm not asking you to fix this -- I'm just asking for you not to just declare this MKL's problem. I'm happy to report this to MKL too though if someone points me in the right direction for reporting things there. |
@hodgestar, please could you disable dispatching export NPY_DISABLE_CPU_FEATURES="AVX2 FMA3 AVX512F AVX512_SKX" If the error disappeared then it belongs to us, and probably fixed by #19529 (just a guess).
if there's something that needs to be cleaned when AVX is involved would be EDIT: extra info related to cleanup SIMD/AVX kernels |
@seiko2plus Thank you! I ran the experiment you requested add these are the results: With AVX512/AVX2 (same environment as described above):
Without AVX512/AVX2:
The script Is there a released version of numpy that includes #19529 that I could try out? Or could someone else try it out if it easy for them? |
@hodgestar the project does not make MKL builds, that would have to come from conda-forge or a local build. |
Yes, but I still need to know which release to try out. |
@seiko2plus is gh-19529 missing a backport than? I had not thought that those non-inline things could mess up AVX registers in ways breaking later code. |
I'm afraid that the issue may not be related to a certain version of NumPy but to the compiler and flags that have been used during the build.
@seberg, after checking the tags and rethinking, the bug may still even exist after #19529. any non-inline function that holds AVX instructions should end with
My guess was based on an old OpenCV issue, discovered within opencv/gh-12964, and if my guess is correct then we need to add |
woooow, OK... That could easily explain all of this and probably some more. I assume if the compiler does correct cleanup, it will end up as a no-op anyway. Sounds like we have to prioritize fixing that! |
@seberg, it's hard to go blind especially the reason behind using |
Well, unless MKL should be doing some additional cleanup and only gets away with it as long as nobody else uses the registers? In that case adding One thing to remember is that on the other issue valgrind reported an invalid write inside MKL. Now, I suppose that could be a false positive (depending on where the write is). But if it is not, I am not sure how uncleared registers would play in. |
I tried to add In order to produce this bug, you need to build NumPy against gcc7 and with flags And the following patch fix it: --- a/numpy/core/src/umath/loops_exponent_log.dispatch.c.src
+++ b/numpy/core/src/umath/loops_exponent_log.dispatch.c.src
@@ -833,11 +833,11 @@ AVX512F_exp_DOUBLE(npy_double * op,
op += num_lanes;
num_remaining_elements -= num_lanes;
}
- if (overflow_mask) {
+ if (npyv_tobits_b64(overflow_mask)) {
npy_set_floatstatus_overflow();
}
- if (underflow_mask) {
+ if (npyv_tobits_b64(underflow_mask)) {
npy_set_floatstatus_underflow();
}
} I have to analyze the generated instructions in order to understand what's going on, but it seems to me that |
@seiko2plus Wow! Nice digging! 🙇 |
@seiko2plus impressive. I didn't notice a PR to do this, did you make one? |
no, I didn't. I wanted to trace FP env vars through gdb before analyzing the generated instructions but I had an issue with logging into my amazon-aws account since I don't have a local CPU with however, here is my gdb trace(check register $ftag, $st0): gdb --args python test_hodgestar_bug.py
(gdb) break zgetrf_
Function "zgetrf_" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (zgetrf_) pending.
(gdb) run
Starting program: /home/ubuntu/miniconda3/envs/py39/bin/python test_hodgestar_bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, 0x00007ffff6f16840 in zgetrf_ () from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_rt.so.1
(gdb) info register $ftag # expect to be 0xFFFF
ftag 0xa016 40982
(gdb) info register $st0 # expect to be 1.490116119384765625e-08
st0 <invalid float value> (raw 0xffff0000000000000000)
(gdb)
(gdb) c
Continuing.
Breakpoint 1, 0x00007fffed77d3e0 in zgetrf_ () from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_intel_lp64.so.1
(gdb) c
Continuing.
/home/ubuntu/.local/lib/python3.9/site-packages/numpy-1.22.0.dev0+1862.g444a721ce-py3.9-linux-x86_64.egg/numpy/linalg/linalg.py:2146: RuntimeWarning: invalid value encountered in det
r = _umath_linalg.det(a, signature=signature)
[Inferior 1 (process 1497) exited normally
(gdb) # now lets fix the FP env vars before call zgetrf_()
(gdb) run
Starting program: /home/ubuntu/miniconda3/envs/py39/bin/python test_hodgestar_bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, 0x00007ffff6f16840 in zgetrf_ () from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_rt.so.1
(gdb) set $ftag = 0xffff
(gdb) set $st0 = 1.490116119384765625e-08
(gdb) c
Continuing.
Breakpoint 1, 0x00007fffed77d3e0 in zgetrf_ ()
from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_intel_lp64.so.1
(gdb) c
Continuing.
[Inferior 1 (process 1510) exited normally]
Final conclusion, generated instructions according to #20356 (comment) within |
Don't count on the compiler for cast bettween mask and int registers. On gcc7 with flags `-march>=nocona -O3` can causes FP stack overflow which may lead to putting NaN into certain HW/FP calculations. For more details, please check the comments in: - numpy#20356
Don't count on the compiler for cast between mask and int registers. On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow which may lead to putting NaN into certain HW/FP calculations. For more details, please check the comments in: - numpy#20356
Don't count on the compiler for cast between mask and int registers. On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow which may lead to putting NaN into certain HW/FP calculations. For more details, please check the comments in: - numpy#20356
@hodgestar, Thank you for your persistence that allowed us to discover such a tricky bug like this. |
Don't count on the compiler for cast between mask and int registers. On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow which may lead to putting NaN into certain HW/FP calculations. For more details, please check the comments in: - numpy#20356
oh, I forget to mention that, here is a simple reproducer that doesn't require intel MKL: # I'm using intel SDE to emulate AVX512 registers/instructions
/opt/sde/sde64 -- python
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> # will cause FP stack overflow leads to deinitialize x87 FPU
>>> # you can also realize that FMT format extended to fp80
>>> np.exp(0)
1e+00
>>> # sqrt instruction requires to fetch the delta constant from register
>>> # $st0 which is NaN as I mentioned at my gdb trace above
>>> np.sqrt(np.longdouble(1))
<stdin>:1: RuntimeWarning: invalid value encountered in sqrt
1.0
>>> |
I seem to remember a discussion on conda-forge about gcc versions, but cannot find it right now. My typical conda-forge install uses gcc9.3.0, I wonder how @hodgestar is getting a gcc-7 based conda environment? Is that from anaconda (default channel) or conda-forge? |
@mattip, sorry, I meant |
Indeed, that reports
I opened an issue with anaconda. |
Don't count on the compiler for cast between mask and int registers. On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow which may lead to putting NaN into certain HW/FP calculations. For more details, please check the comments in: - numpy#20356
Thank you @seiko2plus & @mattip! |
@seiko2plus Not sure if you already know this, but this sort of error happens with gcc-10 as well. I encountered a very similar error while trying to fix #20891 (the fix is here: r-devulap@76f8584). The bug manifests when we use a lot of kmask variables
The write to |
In general, GCC isn't correctly optimizing mask operations on my guess is your situation here is similar but in optimizing bitwise operations of numpy/numpy/core/src/common/simd/avx512/operators.h Lines 188 to 202 in 8f8e14e
not sure but last time I checked the bug tracker was already had many bugs related to SKX and properly get fixed. |
Describe the issue:
The following small script returns
nan+nanj
forx
:Commenting out the
np.exp(0)
or runningnp.linalg.det(a)
again returns the correct result, i.e.1+j0
.I've encountered this in numpy 1.21.2 with CPUs that support AVX512 and MKL installed.
It appears to be fixed in numpy 1.21.4 but I couldn't figure out which commit fixed it, so I reporting it here in case it was only fixed accidentally and other cases are lurking out there. Running numpy without MKL appears to avoid the bug.When it fails, the call to
np.linalg.det
prints out:Reproduce the code example:
Error message:
NumPy/Python version information:
1.21.2 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
The text was updated successfully, but these errors were encountered: