Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 and MKL #20356

Closed
hodgestar opened this issue Nov 12, 2021 · 35 comments · Fixed by #20405
Closed

Comments

@hodgestar
Copy link

hodgestar commented Nov 12, 2021

Describe the issue:

The following small script returns nan+nanj for x:

import numpy as np

a = np.diag([1+0j, 1])
np.exp(0)
x = np.linalg.det(a)

Commenting out the np.exp(0) or running np.linalg.det(a) again returns the correct result, i.e. 1+j0.

I've encountered this in numpy 1.21.2 with CPUs that support AVX512 and MKL installed. It appears to be fixed in numpy 1.21.4 but I couldn't figure out which commit fixed it, so I reporting it here in case it was only fixed accidentally and other cases are lurking out there. Running numpy without MKL appears to avoid the bug.

When it fails, the call to np.linalg.det prints out:

2158: RuntimeWarning: invalid value encountered in det
    r = _umath_linalg.det(a, signature=signature)

Reproduce the code example:

import numpy as np

a = np.diag([1+0j, 1])
np.exp(0)
x = np.linalg.det(a)

assert x == 1

Error message:

2158: RuntimeWarning: invalid value encountered in det
  r = _umath_linalg.det(a, signature=signature)
Traceback (most recent call last):
  File "npexp.py", line 15, in <module>
    assert x == 1
AssertionError

NumPy/Python version information:

1.21.2 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]

@hodgestar
Copy link
Author

Small update: Changing np.exp(0) to np.exp(0+0j) also avoids the issue.

@mattip
Copy link
Member

mattip commented Nov 12, 2021

@seiko2plus any thoughts?

@seberg
Copy link
Member

seberg commented Nov 12, 2021

Very confusing, I wonder if there could be any relation to compiler versions? Is it possible that OpenBLAS was missing some initializations in its kernels and we bumped the OpenBLAS version? Or is this with a self-compiled NumPy where the BLAS would be identical?

@hodgestar
Copy link
Author

I installed both 1.21.2 and 1.21.4 using "conda install numpy==...." and my versions were as follows:

numpy == 1.21.2

$ conda list -n np1212 | grep -i 'blas\|numpy'
blas                      1.0                         mkl  
numpy                     1.21.2           py39h20f2e39_0  
numpy-base                1.21.2           py39h79a1101_0  

numpy == 1.21.4

$ conda list -n np1214 | grep -i 'blas\|numpy'
libblas                   3.9.0           11_linux64_openblas    conda-forge
libcblas                  3.9.0           11_linux64_openblas    conda-forge
liblapack                 3.9.0           11_linux64_openblas    conda-forge
libopenblas               0.3.17          pthreads_h8fe5266_1    conda-forge
numpy                     1.21.4           py39hdbf815f_0    conda-forge

So I do indeed have very different versions of BLAS installed in the two cases. Perhaps this means that the real difference is the two versions of BLAS and not the two numpy versions.

@mattip
Copy link
Member

mattip commented Nov 12, 2021

hmm, so maybe a mkl bug?

@hodgestar
Copy link
Author

The example works fine with numpy 1.20.3 and mkl:

numpy == 1.20.3 and mkl blas

$ conda list -n np1203 | grep -i 'blas\|numpy'
blas                      1.0                         mkl  
numpy                     1.20.3           py39hf144106_0  
numpy-base                1.20.3           py39h74d4b33_0

So if it is an mkl issue, it's an issue that wasn't triggered by 1.20.3.

@seberg
Copy link
Member

seberg commented Nov 12, 2021

Interesting, because if this is MKL, it does look similar to: gh-20233

@r-devulap
Copy link
Member

@hodgestar I couldn't replicate the bug. I am running this on a Intel SKX processor with AVX512 and same packages as you mention:

>> conda list | grep -i 'blas\|numpy'
blas                      1.0                         mkl
numpy                     1.21.2           py38h20f2e39_0
numpy-base                1.21.2           py38h79a1101_0

Am I missing something?

@hodgestar hodgestar changed the title BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 in numpy 1.21.2 BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 in numpy >= 1.21.2 Nov 15, 2021
@hodgestar hodgestar changed the title BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 in numpy >= 1.21.2 BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 Nov 15, 2021
@hodgestar hodgestar changed the title BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 and MKL Nov 15, 2021
@hodgestar
Copy link
Author

@r-devulap Are you also using Python 3.9.7 (default, Sep 16 2021, 13:09:58) and GCC 7.5.0?

In case it makes a difference, I'm running it on Linux machine.

@r-devulap
Copy link
Member

Funny, looks like it doesn't fail on Python 3.8.12. Here are the steps I followed:

conda create --name py38 python=3.8
conda activate py38
conda install numpy=1.21.2

Run your script and it doesn't fail. Repeat the same with Python 3.9.7:

conda create --name py39 python=3.9
conda activate py39
conda install numpy=1.21.2

And I see the error:

/home/raghuveer/anaconda3/envs/py39/lib/python3.9/site-packages/numpy/linalg/linalg.py:2158: RuntimeWarning: invalid value encountered in det
  r = _umath_linalg.det(a, signature=signature)

@hodgestar
Copy link
Author

@r-devulap I'm glad you could reproduce it! This is definitely a tricky bug. :|

@hodgestar
Copy link
Author

A colleague encountered a perhaps related error some time ago. The minimal reproducer he found was:

# eig.py
import numpy as np
import scipy.linalg

x = np.random.rand(25, 25) + 1j*np.random.rand(25, 25)
scipy.linalg.eig(x)

This appeared on macOS (but maybe elsewhere too) and sometimes segfaulted. When run with debug malloc like so, it would sometimes produce a warning:

$ PYTHONMALLOC=pymalloc_debug python -Xfaulthandler,tracemalloc eig.py

And valgrind lke so:

$ valgrind --suppressions=valgrind-python.supp python -Xfaulthandler eig.py

Would sometimes give an error.

See qutip/qutip#1160 for full details (search for the "Minimal reproducer" heading).

I'm not 100% sure it's related, but it is another data point.

@seberg
Copy link
Member

seberg commented Nov 15, 2021

Maybe it is time to push this to MKL? I suppose a pure C reproducer would be nice for that and for confirmation (maybe in valgrind, since that seems to work).

We are now at similar issues with different calls and two different interfaces (since I don't think scipy/numpy share a code here?).

@hodgestar
Copy link
Author

@seberg I would like the possibility that this is a numpy bug taken more seriously. Yes, it definitely could be entirely MKL doing something badly, but it works fine in numpy 1.20 (I should have perhaps made that a bit clearer in my bug report) and MKL works fine without the call to np.exp and calling the MKL function again clears the error (so it seems possible that MKL is cleaning something up that it expects other code to clean up, or perhaps just setting something too late).

The fact that the Python version makes a difference also argues for MKL not doing something egregiously wrong and the error being more complex than that.

I'm not asking you to fix this -- I'm just asking for you not to just declare this MKL's problem. I'm happy to report this to MKL too though if someone points me in the right direction for reporting things there.

@seiko2plus
Copy link
Member

seiko2plus commented Nov 16, 2021

@hodgestar, please could you disable dispatching AVX512/AVX2 in the runtime and try to reproduce this error again?

export NPY_DISABLE_CPU_FEATURES="AVX2 FMA3 AVX512F AVX512_SKX"

If the error disappeared then it belongs to us, and probably fixed by #19529 (just a guess).

so it seems possible that MKL is cleaning something up that it expects other code to clean up, or perhaps just setting something too late

if there's something that needs to be cleaned when AVX is involved would be VZEROUPPER/VZEROALL to avoid the penalty of mixing NON-VEX/SSE. as I remember there was a bug OpenCV face it related to using VZEROUPPER on GCC <= 7.5 and it was fixed by replacing it with VZEROALL when AVX flags are passed to the compiler which may cause a similar bug to this.

EDIT: extra info related to cleanup SIMD/AVX kernels

@hodgestar
Copy link
Author

@seiko2plus Thank you! I ran the experiment you requested add these are the results:

With AVX512/AVX2 (same environment as described above):

$ python npexp4.py 
/home/simon/miniconda3/envs/np1212/lib/python3.9/site-packages/numpy/linalg/linalg.py:2158: RuntimeWarning: invalid value encountered in det
  r = _umath_linalg.det(a, signature=signature)
Traceback (most recent call last):
  File "/home/simon/np1212/npexp4.py", line 14, in <module>
    assert x == 1
AssertionError

Without AVX512/AVX2:

$ NPY_DISABLE_CPU_FEATURES="AVX2 FMA3 AVX512F AVX512_SKX" python npexp4.py

The script npexp4.py is my reproducer snippet from the top of the bug report.

Is there a released version of numpy that includes #19529 that I could try out? Or could someone else try it out if it easy for them?

@mattip
Copy link
Member

mattip commented Nov 16, 2021

@hodgestar the project does not make MKL builds, that would have to come from conda-forge or a local build.

@hodgestar
Copy link
Author

@hodgestar the project does not make MKL builds, that would have to come from conda-forge or a local build.

Yes, but I still need to know which release to try out.

@seberg
Copy link
Member

seberg commented Nov 16, 2021

@seiko2plus is gh-19529 missing a backport than? I had not thought that those non-inline things could mess up AVX registers in ways breaking later code.

@seiko2plus
Copy link
Member

seiko2plus commented Nov 16, 2021

@hodgestar,

Yes, but I still need to know which release to try out.

I'm afraid that the issue may not be related to a certain version of NumPy but to the compiler and flags that have been used during the build.
I'm not familiar with conda, but try to build NumPy from the source and reproduce the error using the same version of NumPy but make sure:

  • To figure out the exact version of GCC
  • Pass the same flags they used to env variable CFLAGS, I think they enabled the AVX flag to all NumPy sources via -mavx or -march=ivybridge, since some build environments assumed that's all running CPUs nowadays have at least AVX extension.
  • get the same MKL version

@seberg, after checking the tags and rethinking, the bug may still even exist after #19529. any non-inline function that holds AVX instructions should end with VZEROUPPER or ZEROALL to avoid performance issues related to mixing with NON-VEX/SSE kind of HW/Kernel(winnt) workaround, almost all compilers handle this routine for us, but still better specify it manually(just in case) and that's why we have a universal intrinsic called npyv_cleanup():

#define npyv_cleanup _mm256_zeroall

My guess was based on an old OpenCV issue, discovered within opencv/gh-12964, and if my guess is correct then we need to add _mm256_zeroall() to the ufunc inner functions that defined in loops_exponent_log.dispatch.c.src.

@seberg
Copy link
Member

seberg commented Nov 16, 2021

and if my guess is correct then we need to add _mm256_zeroall() to AVX functions that defined in loops_exponent_log.dispatch.c.src.

woooow, OK... That could easily explain all of this and probably some more. I assume if the compiler does correct cleanup, it will end up as a no-op anyway. Sounds like we have to prioritize fixing that!

@seiko2plus
Copy link
Member

seiko2plus commented Nov 16, 2021

@seberg, it's hard to go blind especially the reason behind using VZEROUPPER or ZEROALL from the first place is only for performance issues, simply if you execute SSE/NON-VEX encoded instructions after AVX/VEX without call VZEROUPPER or ZEROALL it will lead to slowdown to SSE instructions. opencv/gh-12964 mentioned that ZEROUPPER "unpredictably spoils local variables even when there's no other intrinsics used in the code", so I'm assuming it's related to it but I could be wrong. however, according to @hodgestar, the error disappeared when AVX2, AVX512 are disabled, and since MKL doesn't count on our CPU runtime detection mechanism, then for somehow this issue belongs to us. I will try to dig into it to know what's going on rather than keep guessing.

@seberg
Copy link
Member

seberg commented Nov 16, 2021

then for somehow this issue belongs to us

Well, unless MKL should be doing some additional cleanup and only gets away with it as long as nobody else uses the registers? In that case adding npyv_cleanup() could actually hide the MKL bug?

One thing to remember is that on the other issue valgrind reported an invalid write inside MKL. Now, I suppose that could be a false positive (depending on where the write is). But if it is not, I am not sure how uncleared registers would play in.

@seiko2plus
Copy link
Member

seiko2plus commented Nov 16, 2021

I tried to add npyv_cleanup() but it didn't fix the issue, also after checking conda flags I found that they haven't specified any flags that enable avx to all our sources. however, it's a definitely GCC bug and it belongs to us.

In order to produce this bug, you need to build NumPy against gcc7 and with flags export CFLAGS="-march=nocona -O3 similar to what conda does, I wasn't able to produce the bug on gcc9 or without flags -march=nocona -O3.

And the following patch fix it:

--- a/numpy/core/src/umath/loops_exponent_log.dispatch.c.src
+++ b/numpy/core/src/umath/loops_exponent_log.dispatch.c.src
@@ -833,11 +833,11 @@ AVX512F_exp_DOUBLE(npy_double * op,
         op += num_lanes;
         num_remaining_elements -= num_lanes;
     }
-    if (overflow_mask) {
+    if (npyv_tobits_b64(overflow_mask)) {
         npy_set_floatstatus_overflow();
     }
 
-    if (underflow_mask) {
+    if (npyv_tobits_b64(underflow_mask)) {
         npy_set_floatstatus_underflow();
     }
 }

I have to analyze the generated instructions in order to understand what's going on, but it seems to me that gcc7 badly generate instruction of KMOVB r32, k when flag -mavx512dq is passed to the compiler after -march=nocona -O3, the above change makes sure of correctly use _cvtmask8_u32/KMOVB r32, k when AVX512DQ is available. note: I tried to replacenpy_set_floatstatus_* with printf and nop but the error still remains.

@hodgestar
Copy link
Author

@seiko2plus Wow! Nice digging! 🙇

@mattip
Copy link
Member

mattip commented Nov 18, 2021

@seiko2plus impressive. I didn't notice a PR to do this, did you make one?

@seiko2plus
Copy link
Member

seiko2plus commented Nov 18, 2021

@mattip,

impressive. I didn't notice a PR to do this, did you make one?

no, I didn't. I wanted to trace FP env vars through gdb before analyzing the generated instructions but I had an issue with logging into my amazon-aws account since I don't have a local CPU with AVX512 and also because intel SDE is not friendly with gdb even with -debug flag.

however, here is my gdb trace(check register $ftag, $st0):

gdb --args python test_hodgestar_bug.py
(gdb) break zgetrf_
Function "zgetrf_" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (zgetrf_) pending.
(gdb) run
Starting program: /home/ubuntu/miniconda3/envs/py39/bin/python test_hodgestar_bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, 0x00007ffff6f16840 in zgetrf_ () from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_rt.so.1
(gdb) info register $ftag # expect to be 0xFFFF
ftag           0xa016              40982
(gdb) info register $st0 # expect to be 1.490116119384765625e-08
st0            <invalid float value> (raw 0xffff0000000000000000)
(gdb) 
(gdb) c
Continuing.

Breakpoint 1, 0x00007fffed77d3e0 in zgetrf_ () from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_intel_lp64.so.1
(gdb) c
Continuing.
/home/ubuntu/.local/lib/python3.9/site-packages/numpy-1.22.0.dev0+1862.g444a721ce-py3.9-linux-x86_64.egg/numpy/linalg/linalg.py:2146: RuntimeWarning: invalid value encountered in det
  r = _umath_linalg.det(a, signature=signature)
[Inferior 1 (process 1497) exited normally
(gdb) # now lets fix the FP env vars before call zgetrf_()
(gdb) run 
Starting program: /home/ubuntu/miniconda3/envs/py39/bin/python test_hodgestar_bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, 0x00007ffff6f16840 in zgetrf_ () from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_rt.so.1
(gdb) set $ftag = 0xffff
(gdb) set $st0 = 1.490116119384765625e-08
(gdb) c
Continuing.

Breakpoint 1, 0x00007fffed77d3e0 in zgetrf_ ()
   from /home/ubuntu/miniconda3/envs/py39/lib/libmkl_intel_lp64.so.1
(gdb) c
Continuing.
[Inferior 1 (process 1510) exited normally]

Final conclusion, generated instructions according to #20356 (comment) within AVX512F_exp_DOUBLE() causes FP stack overflows which lead to putting NaN into certain HW/FP calculation.

seiko2plus added a commit to seiko2plus/numpy that referenced this issue Nov 18, 2021
  Don't count on the compiler for cast bettween mask and int registers.
  On gcc7 with flags `-march>=nocona -O3` can causes FP stack overflow
  which may lead to putting NaN into certain HW/FP calculations.

  For more details, please check the comments in:
    - numpy#20356
seiko2plus added a commit to seiko2plus/numpy that referenced this issue Nov 18, 2021
  Don't count on the compiler for cast between mask and int registers.
  On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow
  which may lead to putting NaN into certain HW/FP calculations.

  For more details, please check the comments in:
    - numpy#20356
seiko2plus added a commit to seiko2plus/numpy that referenced this issue Nov 18, 2021
  Don't count on the compiler for cast between mask and int registers.
  On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow
  which may lead to putting NaN into certain HW/FP calculations.

  For more details, please check the comments in:
    - numpy#20356
@seiko2plus
Copy link
Member

seiko2plus commented Nov 18, 2021

@hodgestar, Thank you for your persistence that allowed us to discover such a tricky bug like this. conda should get rid of gcc7 as soon as possible, by checking the GCC bug tracker I found a lot of bugs related to skylake. #20405 should fix this bug.

seiko2plus added a commit to seiko2plus/numpy that referenced this issue Nov 19, 2021
  Don't count on the compiler for cast between mask and int registers.
  On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow
  which may lead to putting NaN into certain HW/FP calculations.

  For more details, please check the comments in:
    - numpy#20356
@seiko2plus
Copy link
Member

seiko2plus commented Nov 19, 2021

oh, I forget to mention that, here is a simple reproducer that doesn't require intel MKL:

# I'm using intel SDE to emulate AVX512 registers/instructions
/opt/sde/sde64 -- python
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> # will cause FP stack overflow leads to deinitialize x87 FPU
>>> # you can also realize that FMT format extended to fp80
>>> np.exp(0)
1e+00
>>> # sqrt instruction requires to fetch the delta constant from register
>>> # $st0 which is NaN as I mentioned at my gdb trace above
>>> np.sqrt(np.longdouble(1))
<stdin>:1: RuntimeWarning: invalid value encountered in sqrt
1.0
>>> 

@mattip
Copy link
Member

mattip commented Nov 19, 2021

conda should get rid of gcc7 as soon as possible, by checking the GCC bug tracker I found a lot of bugs related to skylake

I seem to remember a discussion on conda-forge about gcc versions, but cannot find it right now. My typical conda-forge install uses gcc9.3.0, I wonder how @hodgestar is getting a gcc-7 based conda environment? Is that from anaconda (default channel) or conda-forge?

@seiko2plus
Copy link
Member

@mattip, sorry, I meant anaconda, it seems there're many condas out there. I got their installer from latest/miniconda then I followed @r-devulap's #20356 (comment).

@mattip
Copy link
Member

mattip commented Nov 19, 2021

Indeed, that reports

Python 3.9.5 (default, Jun  4 2021, 12:28:51) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

I opened an issue with anaconda.

charris pushed a commit to charris/numpy that referenced this issue Nov 19, 2021
  Don't count on the compiler for cast between mask and int registers.
  On gcc7 with flags `-march>=nocona -O3` can cause FP stack overflow
  which may lead to putting NaN into certain HW/FP calculations.

  For more details, please check the comments in:
    - numpy#20356
@hodgestar
Copy link
Author

Thank you @seiko2plus & @mattip!

@r-devulap
Copy link
Member

r-devulap commented Jan 28, 2022

@seiko2plus Not sure if you already know this, but this sort of error happens with gcc-10 as well. I encountered a very similar error while trying to fix #20891 (the fix is here: r-devulap@76f8584). The bug manifests when we use a lot of kmask variables __mmask8. Since there are only 8 kmask registers (technically just 7: k0 is not usable at will) , gcc needs to save and restore __mmask8 variables on to the stack and sometimes it uses MMX registers to do so (which is bizarre):

kmovb  %k3, %ecx
movd   %ecx,%mm0
movd   %mm0,-0x38(%rbp)

The write to mm0 register is what sets the $ftag register to an invalid state. This is obviously a GCC bug and exists even in gcc-10. I am trying to write a much simpler reproducer to report the bug to gcc. I couldn't reproduce the bug in gcc-11 though, perhaps its already fixed? Its still worth reporting anyways, I think.

@seiko2plus
Copy link
Member

seiko2plus commented Jan 28, 2022

@r-devulap,

In general, GCC isn't correctly optimizing mask operations on mask8 when SKX flags -mavx512bw -mavx512dq been provided, and sometimes leads to breaking the sequence of stack pointer during the save when it comes to large kernels that involve a lot of kmask variables as you mentioned. In my case here, the issue was related to mask8 to int conversion and it was fixed by taking the lead and using kmovb via _cvtmask8_u32 when it's available instead of letting the compiler do unexpected aggressive optimization. note during my tests I couldn't able to reproduce this bug on gcc-9.

my guess is your situation here is similar but in optimizing bitwise operations of mask8 (mask8 to mask16) instead of mask8 to int conversion(my case). In other words, try to replace intrinsics _mm512_k[and, or, xor, not] with npyv_[and, or, xor, not]_b64 to force using k[and, or, xor, not]b in order to fix this issue.

#ifdef NPY_HAVE_AVX512DQ_MASK
#define npyv_and_b64 _kand_mask8
#define npyv_or_b64 _kor_mask8
#define npyv_xor_b64 _kxor_mask8
#define npyv_not_b64 _knot_mask8
#else
NPY_FINLINE npyv_b64 npyv_and_b64(npyv_b64 a, npyv_b64 b)
{ return (npyv_b64)_mm512_kand((npyv_b32)a, (npyv_b32)b); }
NPY_FINLINE npyv_b64 npyv_or_b64(npyv_b64 a, npyv_b64 b)
{ return (npyv_b64)_mm512_kor((npyv_b32)a, (npyv_b32)b); }
NPY_FINLINE npyv_b64 npyv_xor_b64(npyv_b64 a, npyv_b64 b)
{ return (npyv_b64)_mm512_kxor((npyv_b32)a, (npyv_b32)b); }
NPY_FINLINE npyv_b64 npyv_not_b64(npyv_b64 a)
{ return (npyv_b64)_mm512_knot((npyv_b32)a); }
#endif

I couldn't reproduce the bug in gcc-11 though, perhaps its already fixed? Its still worth reporting anyways, I think.

not sure but last time I checked the bug tracker was already had many bugs related to SKX and properly get fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants