Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: <input>:1: RuntimeWarning: invalid value encountered in matmul #24067

Closed
hbombDTU opened this issue Jun 28, 2023 · 43 comments · Fixed by #24199
Closed

BUG: <input>:1: RuntimeWarning: invalid value encountered in matmul #24067

hbombDTU opened this issue Jun 28, 2023 · 43 comments · Fixed by #24199
Labels

Comments

@hbombDTU
Copy link

Describe the issue:

Getting an error with numpy version 1.25.0 when doing matrix multiplication with operator '@'.

When an array surpasses a certain size limit. Multiplying the two matrices produces a RuntimeWarning.

However, the warning is gone when downgrading numpy to 1.24.4. It also is removed when making a copy of the array and multiplying the original matrix with the copy. (X.T @ X_copy)

file1.zip
file2.zip
file3.zip

Reproduce the code example:

import numpy as np 

# Loading the first file
new = np.load('file1.zip')['file1']

# concatenating the other files 
for file in ['file2', 'file3']:
    x = np.load(f'{file }.zip')[file ]
    new = np.concatenate([new, x])

# performing matrix multiplication
new.T @ new

Error message:

<input>:1: RuntimeWarning: invalid value encountered in matmul

Runtime information:

import sys, numpy; print(numpy.version); print(sys.version)
1.25.0
3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]

print(numpy.show_runtime())
[{'numpy_version': '1.25.0',
'python': '3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 '
'bit (AMD64)]',
'uname': uname_result(system='Windows', node='PC04218', release='10', version='10.0.22621', machine='AMD64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Haswell',
'filepath': 'C:\Users\hjk\dcalgo\dcalgo-deidid_q1to16_nurser_vwap\.venv\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.23-gcc_10_3_0.dll',
'internal_api': 'openblas',
'num_threads': 24,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23'},
{'architecture': 'Haswell',
'filepath': 'C:\Users\hjk\dcalgo\dcalgo-deidid_q1to16_nurser_vwap\.venv\Lib\site-packages\scipy.libs\libopenblas_v0.3.20-571-g3dec11c6-gcc_10_3_0-c2315440d6b6cef5037bad648efc8c59.dll',
'internal_api': 'openblas',
'num_threads': 24,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.21.dev'},
{'filepath': 'C:\Users\hjk\dcalgo\dcalgo-deidid_q1to16_nurser_vwap\.venv\Lib\site-packages\sklearn\.libs\vcomp140.dll',
'internal_api': 'openmp',
'num_threads': 24,
'prefix': 'vcomp',
'user_api': 'openmp',
'version': None}]
None

Context for the issue:

No response

@charris
Copy link
Member

charris commented Jun 28, 2023

We see that occasionally on Windows, it doesn't seem repeatable. Are you using a 32 bit version of NumPy?

@hbombDTU
Copy link
Author

Hi Charris,
I believe we are using 64 bit version of NumPy.

@maciejskorski
Copy link

I can't reproduce either, under Colab is fine. See / try on this notebook.

@StefRe
Copy link

StefRe commented Jun 29, 2023

@maciejskorski the Colab notebook gives the same warning <ipython-input-4-fa27106da462>:10: RuntimeWarning: invalid value encountered in matmul new.T @ new when run with numpy version 1.25.0

@maciejskorski
Copy link

maciejskorski commented Jun 29, 2023

@StefRe thanks for testing the breaking version.
I also tested that the matrix content is irrelevant, the error is trigerred by this:

M = np.random.random((30000, 270)) # on Colab warnings under numpy==1.25 
M.T @ M

So we have our minimal-not-working-example! (MNWE) 👍

@seberg
Copy link
Member

seberg commented Jun 29, 2023

Thanks, this must come somewhere from the OpenBLAS internals. Probably while deciding things like how to chunk up the job (in theory it could be uninitialized values).

The annoying thing, it would help to recompile openblas with debugging info (and maybe not super high optimization). That is unfortunately only the first step, the second one is to go write a small C program and compile with trapping math (or enable it), or run NumPy but then you have to enable the trapping math after importing NumPy (which is possible, by hacking it in).

Is anyone interested in diving into that? I can share some hack for the second part.

(I have done this type of thing once, but without recompiling openblas so it was a bit hard to really figure out where the issue happens exactly.)

@seberg
Copy link
Member

seberg commented Jun 29, 2023

Actually, maybe enabling trapping math (while possible) isn't that helpful... We know approximately where this happens, so stepping though until a NaN pops up somewhere may well be good enough.

@maciejskorski
Copy link

maciejskorski commented Jun 29, 2023

Why this has not been detected by mat-mul tests? Don't they check for warnings 🤔
The matrix size that triggers problems depends probably on the architecture, memory and so on? Playing with my notebook, I can see that (29000,270) works fine but (30000,270) triggers errors. @seberg this is what you referred to as "chunking up the job"?

@seberg
Copy link
Member

seberg commented Jun 29, 2023

We do see it randomly, but only "reliably" (still randomly) on the win32 CI (if this is the same thing). So CI might well hit it, but maybe only on your machine or so.
It is not completely impossible that this comes from uninintialized values even, or the compiler doing unsafe optimizations...

Or, our test suite just doesn't run the magic size that triggers it realiably here... In either case, we need to find out where it happens, ideally probably in a debugger where you can step through the OpenBLAS code to figure it out precisely. Once that is known, the Openblas fixup is probably easy.

chunking up the job

The BLAS implementation always has a lot of calculations to figure out how to distribute the kernel to its worker threads (and maybe just finding the size of scratch-space needed). The most probably issue is that one of those calculations causes a NaN to be created where it just doesn't matter. For example, by taking the root of a negative number or (or by uising uninitialized memory that happens to be an infinite, or...

EDIT: That said, it could of course also be the kernel itself that uses uninitialized values due to vectorization.

@maciejskorski
Copy link

We do see it randomly, but only "reliably" (still randomly) on the win32 CI (if this is the same thing). So CI might well hit it, but maybe only on your machine or so. It is not completely impossible that this comes from uninintialized values even, or the compiler doing unsafe optimizations...

The example I shared is reproducible under Colab's container so this happens on Linux too.

Or, our test suite just doesn't run the magic size that triggers it realiably here... In either case, we need to find out where it happens, ideally probably in a debugger where you can step through the OpenBLAS code to figure it out precisely. Once that is known, the Openblas fixup is probably easy.

chunking up the job

The BLAS implementation always has a lot of calculations to figure out how to distribute the kernel to its worker threads (and maybe just finding the size of scratch-space needed). The most probably issue is that one of those calculations causes a NaN to be created where it just doesn't matter. For example, by taking the root of a negative number or (or by uising uninitialized memory that happens to be an infinite, or...

In this specific case, mat-mul, it is likely chunking the matrix to make it into L2 cache or so?

Would it make sense to test the problematic mat-mul under the "native" OpenBLAS following their docs?

EDIT: That said, it could of course also be the kernel itself that uses uninitialized values due to vectorization.

@mattip
Copy link
Member

mattip commented Jun 29, 2023

Why this has not been detected by mat-mul tests? Don't they check for warnings

The easy answer is "it hits a code path we do not hit in tests". The harder question is "what codepath is triggered". I would imagine it is something like #16744, which turned out to be caused by registers not being properly restored. Maybe there is some race condition when chunking the large array.

@mattip
Copy link
Member

mattip commented Jun 29, 2023

Can you try limiting OpenBLAS to one thread with threadpoolctl

from threadpoolctl import threadpool_limits
import numpy as np
with threadpool_limits(limits=1, user_api='blas'):
    M = np.random.random((30000, 270))
    M.T @ M

@mattip
Copy link
Member

mattip commented Jun 29, 2023

... answering my own question:

  • this reproduces on Ubuntu22.04 on an AMD64 machine with NumPy 1.25.0
  • it reproduces even when limiting OpenBLAS to one thread

@mattip
Copy link
Member

mattip commented Jun 29, 2023

>>> np.show_runtime()
[{'numpy_version': '1.25.0',
  'python': '3.11.4 (main, Jun  7 2023, 12:45:48) [GCC 11.3.0]',
  'uname': uname_result(system='Linux', node='matti-ryzen', release='5.15.0-75-generic', version='#82-Ubuntu SMP Tue Jun 6 23:10:23 UTC 2023', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2'],
                      'not_found': ['AVX512F',
                                    'AVX512CD',
                                    'AVX512_KNL',
                                    'AVX512_KNM',
                                    'AVX512_SKX',
                                    'AVX512_CLX',
                                    'AVX512_CNL',
                                    'AVX512_ICL']}},
 {'architecture': 'Zen',
  'filepath': '/tmp/venv3/lib/python3.11/site-packages/numpy.libs/libopenblas64_p-r0-7a851222.3.23.so',
  'internal_api': 'openblas',
  'num_threads': 24,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.23'}]
>>> 

@maciejskorski
Copy link

Can you try limiting OpenBLAS to one thread with threadpoolctl

Not helping :-(
image

@seberg
Copy link
Member

seberg commented Jun 29, 2023

Is it specific to a kernel? I once had a similar things on the Mac actually, and tried to narrow things down, but never got further than "its in the syrk routines where some sizes are estimated". It would be nice if its there, because the alternative seems even harder to figure out.

@maciejskorski
Copy link

Here is the Colab runtime spec:

[{'numpy_version': '1.25.0',
  'python': '3.10.12 (main, Jun  7 2023, 12:45:35) [GCC 9.4.0]',
  'uname': uname_result(system='Linux', node='ea606d5c044b', release='5.15.107+', version='#1 SMP Sat Apr 29 09:15:28 UTC 2023', machine='x86_64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2'],
                      'not_found': ['AVX512F',
                                    'AVX512CD',
                                    'AVX512_KNL',
                                    'AVX512_KNM',
                                    'AVX512_SKX',
                                    'AVX512_CLX',
                                    'AVX512_CNL',
                                    'AVX512_ICL']}},
 {'architecture': 'Haswell',
  'filepath': '/usr/local/lib/python3.10/dist-packages/numpy.libs/libopenblas64_p-r0-7a851222.3.23.so',
  'internal_api': 'openblas',
  'num_threads': 2,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.23'}]

@seberg
Copy link
Member

seberg commented Jun 29, 2023

Haswell and Zen... seems unlikely that it is the core type/kernel then (although might use the same one here I guess).

@mattip
Copy link
Member

mattip commented Jun 29, 2023

Strangely, the result does not have a Nan nor an inf:

>>> X = M.T @ M
<stdin>:1: RuntimeWarning: invalid value encountered in matmul
>>> np.any(np.isnan(X))
False
>>> np.any(np.isinf(X))
False

@maciejskorski
Copy link

Haswell and Zen... seems unlikely that it is the core type/kernel then (although might use the same one here I guess).

I can only make educated guesses, but to me the experiment "matrix size matters for the warning" plus your former remarks on chunking the job suggests it has something to do with doing mat-mul in batches with cache?

@maciejskorski
Copy link

maciejskorski commented Jun 29, 2023

Strangely, the result does not have a Nan nor an inf:

>>> X = M.T @ M
<stdin>:1: RuntimeWarning: invalid value encountered in matmul
>>> np.any(np.isnan(X))
False
>>> np.any(np.isinf(X))
False

It does not. The same question asked on SO, people blamed the OP for feeding ML models with NaNs 😄

@hbombDTU
Copy link
Author

The warning has been reproduced with my colleagues also. So it is not a kernel issue and we have also checked for NaNs and infs. Would there be further investigations into this? The calculations seems to work fine, however, but what would be potential problems with this warning?

@seberg
Copy link
Member

seberg commented Jun 29, 2023

This this helps (warning, doesn't work in ipython for me, becaus ipython itself causes floating point exceptions -- although, if running in a debug you can probably continue on those.):

import ctypes
import numpy as np
dll = ctypes.CDLL(np.core._multiarray_umath.__file__)
dll.feclearexcept(61)
dll.feenableexcept(61)

The above 61 is the hardcoded mask value that I have on my linux laptop. If you do that and then run the code we may be able to narrow down where its raised first... After that maybe can go deeper.

(Yes, this is a hack, and will only work on linux.) EDIT2: of course this is only useful if you run inside a debugger, such as with gdb --args python and then r.

@mattip
Copy link
Member

mattip commented Jun 29, 2023

...
>>> dll.feenableexcept(61)
0
>>> M = np.random.random((30000, 270))
>>> M.T @ M

Thread 1 "python" received signal SIGFPE, Arithmetic exception.
0x00007ffff55417e5 in cblas_dsyrk () from /usr/local/lib/libopenblas.so.0
(gdb) 

@mattip
Copy link
Member

mattip commented Jun 29, 2023

(gdb) bt
#0  0x00007ffff55417e5 in cblas_dsyrk () from /usr/local/lib/libopenblas.so.0
#1  0x00007ffff73a3153 in DOUBLE_matmul_matrixmatrix (ip1=0x7fffb6b3b010, is1_m=8, is1_n=2160, ip2=0x7fffb6b3b010, is2_n=2160, 
    is2_p=8, op=0x7fffb6aac010, os_m=2160, os_p=8, m=270, n=30000, p=270) at ../numpy/core/src/umath/matmul.c.src:174
#2  0x00007ffff73a671e in DOUBLE_matmul (args=0x7ffff7bbb980, dimensions=0x7fffffffbc88, steps=0xf510b8, __NPY_UNUSED_TAGGEDfunc=0x0)
    at ../numpy/core/src/umath/matmul.c.src:478
...

@martin-frbg any thoughts? Quick recap: this happens when calling syrk with a large input matrix. It did not happen on OpenBLAS 0.3.20, and happens in 0.3.23.

@martin-frbg
Copy link

Few recent changes, mostly related to where the switch from single to multithreading occurs. One such change made chunking-related variables available to "dynamic arch" builds that used to use one set for all cpus previously, maybe this is where an uninitialized variable could come into play. I'll try to reproduce and bisect later today

@maciejskorski
Copy link

To test another OpenBLAS version one has to compile from sources, is that right?
Is this issue something that can be reduced to few lines in native OpenBLAS and delegated to their dev team?

@mattip
Copy link
Member

mattip commented Jun 29, 2023

delegated to their dev team?

Martin is the OpenBLAS dev team :)

@maciejskorski
Copy link

delegated to their dev team?

Martin is the OpenBLAS dev team :)

faux pas. I am very sorry :-) I hope we will keep this discussion as much verbose and educative as possible, internals of computations sound fascinating.

@martin-frbg
Copy link

Silly bug introduced by me three months ago when I reworked the multithreading thresholds for SYMM and SYRK - the calculated workload size could trivially overflow the puny ìnt I had provided for it. (Which also explains why the result of the call was still correct, unless one trapped the exception) Trivial fix in OpenMathLib/OpenBLAS#4116

@martin-frbg
Copy link

So not much to learn from it except don't trust my judgement :/

@charris
Copy link
Member

charris commented Jun 29, 2023

Darn, I was hoping for a fix of the long time occasional warnings on Windows :) But now I wonder if it is related to Windows having 32 bit default integers?

@charris
Copy link
Member

charris commented Jun 29, 2023

Just for reference, the Windows problem manifested after the @ operator was directly implemented instead of using einsum.

@maciejskorski
Copy link

maciejskorski commented Jun 29, 2023

Just for reference, the Windows problem manifested after the @ operator was directly implemented instead of using einsum.

Here we were able to locate the problem because of reproducing and narrowing the scope in Colab, that others could access independently and comment. How do we debug and reproduce on Windows?

@charris
Copy link
Member

charris commented Jun 29, 2023

@maciejskorski The problem is that the warning is only occasional, but always in the same test. There is another occasional warning involving complex numbers. Both of those only occur on Windows and don't seem to affect the correctness of the results, they just warn of invalid values.

@maciejskorski
Copy link

maciejskorski commented Jun 29, 2023

@maciejskorski The problem is that the warning is only occasional, but always in the same test. There is another occasional warning involving complex numbers. Both of those only occur on Windows and don't seem to affect the correctness of the results, they just warn of invalid values.

Is this a numpy test on CI/CD? Occasionally, because the data is randomized or because the algorithm is not fully deterministic (for various technical reasons, like parallelization)? Could we track and eventually (once spotted) share the data triggering the error?

@martin-frbg
Copy link

Is it reasonable to assume the other warning is related to something in OpenBLAS as well ?

@charris
Copy link
Member

charris commented Jun 29, 2023

@martin-frbg I don't know. I've been shrugging them off as probably a platform/compiler problem. Going forward, I am going to start tracking them in an issue. They only occur a couple of times a month, and seem less frequent than they used to be, but that is very subjective.

@mattip
Copy link
Member

mattip commented Jun 30, 2023

Is the OpenBLAS develop branch stable enough to update openblas-libs to fix this issue?

@martin-frbg
Copy link

I think so, I'm currently trying to coordinate with xianyi after he made one of his ninja commits but I expect any immediate action will be limited to risc-v

@hbombDTU
Copy link
Author

Could I get the current status for this bug?

@mattip
Copy link
Member

mattip commented Jul 11, 2023

I tried building the OpenBLAS version in MacPython/openblas-libs#101 but the macos arm64 build failed to recognize the new SVE intrinsics.

@mattip
Copy link
Member

mattip commented Jul 17, 2023

#24199 updates OpenBLAS to a version with a fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants