BUG: numpy.linalg.inv returning different values on consecutive calls with some nan elements #20233

leobiec · 2021-10-29T14:31:10Z

Describe the issue:

I have a calculus of an inverse matrix in the middle of my code (some module developed by me) and debugging I found that numpy could not provide and inverse matrix. Also, only a few elements where reported as nan (don´t know if this is as expected or the complete matrix should be nan).

Anyway, if I stop the calculus at this point, and call again the inverse, it does provide an inverse matrix. Seems that something is changed in the numpy module in the first call, and not restored, so the next calls I have a different output.

The code I provide as an example does not work stand alone. I tried executing at the beginning of my complete code and it does provide the same result. So, something in my code previously has an effect on how numpy gets configured to the place where it does fail.

Sorry I cant provide the complete code and libraries. Don’t know how can I provide some stand alone code to reproduce this failure. Can I help the debugging of this issue in an other way?

Regards

Leonardo

Reproduce the code example:

a = np.array([[ 1.06902137e-08+1.03397577e-25j, -8.29800043e-09+1.46316136e-09j,
         1.06153905e-08+7.03229683e-10j, -8.26255952e-09+1.45691217e-09j],
       [-8.29800043e-09-1.46316136e-09j,  1.06902137e-08+0.00000000e+00j,
        -8.26255952e-09-1.45691217e-09j,  1.06153905e-08-7.03229683e-10j],
       [ 1.06153905e-08-7.03229683e-10j, -8.26255952e-09+1.45691217e-09j,
         1.06902137e-08+7.23783036e-25j, -8.29800043e-09+1.46316136e-09j],
       [-8.26255952e-09-1.45691217e-09j,  1.06153905e-08+7.03229683e-10j,
        -8.29800043e-09-1.46316136e-09j,  1.06902137e-08-7.23783036e-25j]])
print(np.linalg.inv(a))
print(np.linalg.inv(a))

[[            nan           +nanj -3.75115133e+11-5.40693281e+10j
  -3.79340139e+11-1.19099169e+11j  3.72964747e+11-6.57637077e+10j]
 [            nan           +nanj  3.98070598e+11-1.64825002e-01j
   3.72964747e+11+6.57637077e+10j -3.79340139e+11+1.19099169e+11j]
 [            nan           +nanj  3.72964747e+11-6.57637077e+10j
   3.98070588e+11-1.88123676e-01j -3.34000150e+11+1.79105405e+11j]
 [            nan+6.57637077e+10j -3.79340139e+11-1.19099169e+11j
  -3.34000150e+11-1.79105405e+11j  3.98070588e+11-1.64492045e-01j]]
[[ 3.98070598e+11-1.88316777e-01j -3.75115133e+11-5.40693281e+10j
  -3.79340139e+11-1.19099169e+11j  3.72964747e+11-6.57637077e+10j]
 [-3.75115133e+11+5.40693281e+10j  3.98070598e+11-1.64825002e-01j
   3.72964747e+11+6.57637077e+10j -3.79340139e+11+1.19099169e+11j]
 [-3.79340139e+11+1.19099169e+11j  3.72964747e+11-6.57637077e+10j
   3.98070588e+11-1.88123676e-01j -3.34000150e+11+1.79105405e+11j]
 [ 3.72964747e+11+6.57637077e+10j -3.79340139e+11-1.19099169e+11j
  -3.34000150e+11-1.79105405e+11j  3.98070588e+11-1.64492045e-01j]]

Error message:

No response

NumPy/Python version information:

numpy version 1.20.1

leobiec · 2021-10-29T15:01:26Z

I just found that calling for the second time not allways solve the problem. I added the next code for the moment, and i found that at least one time it was calculated twice to get non nan result.

RpSubInv = np.linalg.inv(RpSub)
if np.isnan(RpSubInv).any():
    RpSubInv = np.linalg.inv(RpSub)
if np.isnan(RpSubInv).any():
    RpSubInv = np.linalg.inv(RpSub)
if np.isnan(RpSubInv).any():
    print('No way...')

seberg · 2021-10-29T15:10:12Z

@leobiec can you say how NumPy was installed, what OS you are using? giving the output of np.show_config() may also help. If there is an issue it is with the linear algebra implementation probably.

E.g. is this using openblas, MKL, threaded? We then should try things like running with OMP_NUM_THREADS=1, or try running/pasting this script (if it gives any reasonable output): https://gist.githubusercontent.com/seberg/ce4563f3cb00e33997e4f80675b80953/raw/535ffd09dfbd6d41860ecf1effd8edfcea698810/guess_numpy_blas.py

The matrix does have a very large condition-number and the singular values are all rather small, but I do get pretty much the same result, so I doubt that is the issue?

leobiec · 2021-10-29T16:01:53Z

Hi @seberg,

Besides the matrix condition, i think the different results in succesive callings should not expected.

np.show_config()
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']

The script returned:

Probing Multiarray
------------------
Unable to guess BLAS implementation, it is not one of: dict_values(['openblas', 'atlas', 'blis', 'MKL', 'accelerate'])
    or additional symbols are not loaded?!

and how can I do this:

We then should try things like running with OMP_NUM_THREADS=1

Don't have any problem in making a google meet to show you what is happening if it helps and to figure out what can we log

seberg · 2021-10-29T16:09:13Z

Well, it looks like you are using anaconda with MKL, so if there is an issue it is almost certainly with MKL. I am not an MKL specialist although someone else here may have an idea. Especially if you are threading yourself, trying to set the number of threads to 1 in mkl (e.g. using threadpoolctl may be worth a shot.

bashtage · 2021-10-30T12:25:55Z

I get the same with MKL + Ryzen. You might have a hardware problem. You might try running a memory test.

seberg · 2021-10-30T14:19:12Z

Running a memory test could be interesting. My list of things to try would be:

Try to reduce the number of threads to 1 (disabling threading, especially if your app is threaded).
See if MKL can select a different "kernel" (optimization). Maybe not, and if you are on AMD newer MKL auto-selects crippled kernels anyway (but it may be that anaconda ships older MKLs for that exact reason)
Use OpenBLAS: It probably just means uninstalling some numpy+mkl conda package or so, but I forget since I do not use anaconda regularly.
To help see if there is something else going on: Figure out what is special about your app (multi-threading, ..., ?)

Switching to openblas is probably the easiest, promising, "hot fix" candidate, plus if problems persist we really know something else is very wrong.

leobiec · 2021-11-01T17:39:52Z

Hi everyone.

@bashtage, thanks for your collaboration. I do get the same result if a run the example code I gave in the issue. As I mentioned, the code is not self-contained. But when ran in the main script (in debug or not debug mode) I get the different results.
I don't think it is a memory problem since it fails when running from Jupyter notebook, exported py and also debug mode (don't think the memory usage would be the same).

I think something inside numpy is changing in the middle of my code and again there in the first call.

Is there a way to save complete actual workspace status? Or an other way I can show you the failure since the code is not self-cointaned?

I am not doing any multi-threading in the code. If the modules or the conda environment is doing it I do not know. @seberg still you think it would be useful to switch to openblas? I don´t know how to do this and it will take a bit of time to figure out.

bashtage · 2021-11-01T18:24:09Z

I don't think it is a memory problem since it fails when running from Jupyter notebook, exported py and also debug mode (don't think the memory usage would be the same).

IME this is precisely what I would expect if it was something deep like a hardware problem. Things like physical memory are all over the place and would affect python in any environment.

Do you see the same if you install all packages using pip (this will use OpenBLAS)? Basically

conda create -n use-pip python=3.8
conda activate use-pip
pip install numpy ipython

then run your script.

I think something inside numpy is changing in the middle of my code and again there in the first call.

IMO the only way something internal is changing is if memory is corrupt so that it is changing effectively at random.

leobiec · 2021-11-04T13:38:30Z

So, I run memory test and nothing wrong was found.

I created two environments both with conda create (don't know if it is ok or if I must have created a pip virtual env) but in one of them I installed with pip and in the other with conda.

name: basic_conda_env
channels:
  - defaults
dependencies:
  - argon2-cffi=20.1.0=py38h2bbff1b_1
  - async_generator=1.10=pyhd3eb1b0_0
  - attrs=21.2.0=pyhd3eb1b0_0
  - backcall=0.2.0=pyhd3eb1b0_0
  - blas=1.0=mkl
  - bleach=4.0.0=pyhd3eb1b0_0
  - brotli=1.0.9=ha925a31_2
  - ca-certificates=2021.10.26=haa95532_2
  - certifi=2021.10.8=py38haa95532_0
  - cffi=1.14.6=py38h2bbff1b_0
  - colorama=0.4.4=pyhd3eb1b0_0
  - cycler=0.10.0=py38_0
  - debugpy=1.4.1=py38hd77b12b_0
  - decorator=5.1.0=pyhd3eb1b0_0
  - defusedxml=0.7.1=pyhd3eb1b0_0
  - entrypoints=0.3=py38_0
  - fonttools=4.25.0=pyhd3eb1b0_0
  - freetype=2.11.0=ha860e81_0
  - icc_rt=2019.0.0=h0cc432a_1
  - icu=58.2=ha925a31_3
  - importlib-metadata=4.8.1=py38haa95532_0
  - importlib_metadata=4.8.1=hd3eb1b0_0
  - intel-openmp=2021.4.0=haa95532_3556
  - ipykernel=6.4.1=py38haa95532_1
  - ipython=7.29.0=py38hd4e2768_0
  - ipython_genutils=0.2.0=pyhd3eb1b0_1
  - ipywidgets=7.6.5=pyhd3eb1b0_1
  - jedi=0.18.0=py38haa95532_1
  - jinja2=3.0.2=pyhd3eb1b0_0
  - jpeg=9d=h2bbff1b_0
  - jsonschema=3.2.0=pyhd3eb1b0_2
  - jupyter=1.0.0=py38_7
  - jupyter_client=7.0.1=pyhd3eb1b0_0
  - jupyter_console=6.4.0=pyhd3eb1b0_0
  - jupyter_core=4.8.1=py38haa95532_0
  - jupyterlab_pygments=0.1.2=py_0
  - jupyterlab_widgets=1.0.0=pyhd3eb1b0_1
  - kiwisolver=1.3.1=py38hd77b12b_0
  - libpng=1.6.37=h2a8f88b_0
  - libtiff=4.2.0=hd0e1b90_0
  - libwebp=1.2.0=h2bbff1b_0
  - lz4-c=1.9.3=h2bbff1b_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.0.1=py38h2bbff1b_0
  - matplotlib=3.4.3=py38haa95532_0
  - matplotlib-base=3.4.3=py38h49ac443_0
  - matplotlib-inline=0.1.2=pyhd3eb1b0_2
  - mistune=0.8.4=py38he774522_1000
  - mkl=2021.4.0=haa95532_640
  - mkl-service=2.4.0=py38h2bbff1b_0
  - mkl_fft=1.3.1=py38h277e83a_0
  - mkl_random=1.2.2=py38hf11a4ad_0
  - msys2-conda-epoch=20160418=1
  - munkres=1.1.4=py_0
  - nbclient=0.5.3=pyhd3eb1b0_0
  - nbconvert=6.1.0=py38haa95532_0
  - nbformat=5.1.3=pyhd3eb1b0_0
  - nest-asyncio=1.5.1=pyhd3eb1b0_0
  - notebook=6.4.5=py38haa95532_0
  - numpy=1.21.2=py38hfca59bb_0
  - numpy-base=1.21.2=py38h0829f74_0
  - olefile=0.46=pyhd3eb1b0_0
  - openssl=1.1.1l=h2bbff1b_0
  - packaging=21.0=pyhd3eb1b0_0
  - pandocfilters=1.4.3=py38haa95532_1
  - parso=0.8.2=pyhd3eb1b0_0
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pillow=8.4.0=py38hd45dc43_0
  - pip=21.0.1=py38haa95532_0
  - prometheus_client=0.11.0=pyhd3eb1b0_0
  - prompt-toolkit=3.0.20=pyhd3eb1b0_0
  - prompt_toolkit=3.0.20=hd3eb1b0_0
  - pycparser=2.20=py_2
  - pygments=2.10.0=pyhd3eb1b0_0
  - pyparsing=3.0.4=pyhd3eb1b0_0
  - pyqt=5.9.2=py38ha925a31_4
  - pyrsistent=0.17.3=py38he774522_0
  - python=3.8.12=h6244533_0
  - python-dateutil=2.8.2=pyhd3eb1b0_0
  - pywin32=228=py38hbaba5e8_1
  - pywinpty=0.5.7=py38_0
  - pyzmq=22.2.1=py38hd77b12b_1
  - qt=5.9.7=vc14h73c81de_0
  - qtconsole=5.1.1=pyhd3eb1b0_0
  - qtpy=1.10.0=pyhd3eb1b0_0
  - scipy=1.7.1=py38hbe87c03_2
  - send2trash=1.8.0=pyhd3eb1b0_1
  - setuptools=58.0.4=py38haa95532_0
  - sip=4.19.13=py38ha925a31_0
  - six=1.16.0=pyhd3eb1b0_0
  - sqlite=3.36.0=h2bbff1b_0
  - terminado=0.9.4=py38haa95532_0
  - testpath=0.5.0=pyhd3eb1b0_0
  - tk=8.6.11=h2bbff1b_0
  - tornado=6.1=py38h2bbff1b_0
  - traitlets=5.1.0=pyhd3eb1b0_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wcwidth=0.2.5=pyhd3eb1b0_0
  - webencodings=0.5.1=py38_1
  - wheel=0.37.0=pyhd3eb1b0_1
  - widgetsnbextension=3.5.1=py38_0
  - wincertstore=0.2=py38haa95532_2
  - winpty=0.4.3=4
  - xz=5.2.5=h62dcd97_0
  - zipp=3.6.0=pyhd3eb1b0_0
  - zlib=1.2.11=h62dcd97_4
  - zstd=1.4.9=h19a0ad4_0

and

name: basic_pip_env
channels:
  - defaults
dependencies:
  - ca-certificates=2021.10.26=haa95532_2
  - certifi=2021.10.8=py38haa95532_0
  - openssl=1.1.1l=h2bbff1b_0
  - pip=21.0.1=py38haa95532_0
  - python=3.8.12=h6244533_0
  - setuptools=58.0.4=py38haa95532_0
  - sqlite=3.36.0=h2bbff1b_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wheel=0.37.0=pyhd3eb1b0_1
  - wincertstore=0.2=py38haa95532_2
  - pip:
    - argon2-cffi==21.1.0
    - attrs==21.2.0
    - backcall==0.2.0
    - bleach==4.1.0
    - cffi==1.15.0
    - colorama==0.4.4
    - cycler==0.11.0
    - debugpy==1.5.1
    - decorator==5.1.0
    - defusedxml==0.7.1
    - entrypoints==0.3
    - ipykernel==6.5.0
    - ipython==7.29.0
    - ipython-genutils==0.2.0
    - ipywidgets==7.6.5
    - jedi==0.18.0
    - jinja2==3.0.2
    - jsonschema==4.1.2
    - jupyter==1.0.0
    - jupyter-client==7.0.6
    - jupyter-console==6.4.0
    - jupyter-core==4.9.1
    - jupyterlab-pygments==0.1.2
    - jupyterlab-widgets==1.0.2
    - kiwisolver==1.3.2
    - markupsafe==2.0.1
    - matplotlib==3.4.3
    - matplotlib-inline==0.1.3
    - mistune==0.8.4
    - nbclient==0.5.4
    - nbconvert==6.2.0
    - nbformat==5.1.3
    - nest-asyncio==1.5.1
    - notebook==6.4.5
    - numpy==1.21.2
    - packaging==21.2
    - pandocfilters==1.5.0
    - parso==0.8.2
    - pickleshare==0.7.5
    - pillow==8.4.0
    - prometheus-client==0.12.0
    - prompt-toolkit==3.0.21
    - pycparser==2.20
    - pygments==2.10.0
    - pyparsing==2.4.7
    - pyrsistent==0.18.0
    - python-dateutil==2.8.2
    - pywin32==302
    - pywinpty==1.1.5
    - pyzmq==22.3.0
    - qtconsole==5.1.1
    - qtpy==1.11.2
    - scipy==1.7.1
    - send2trash==1.8.0
    - six==1.16.0
    - terminado==0.12.1
    - testpath==0.5.0
    - tornado==6.1
    - traitlets==5.1.1
    - wcwidth==0.2.5
    - webencodings==0.5.1
    - widgetsnbextension==3.5.2

Now, if I run in basic_conda_env both in debug mode, running to terminal, or in Jupyter notewbook I see numpy delivering the wrong output (the matrix with some nan elements). Each of them solved with one iteration as I mentioned before. The number of times it fails is not the same if I run in jupyter notebook as in the other cases… But for each cases, the number of failures does not seems to be changing at all!

If I run with the basic_pip_env it does not fail at all in any case!!

In the middle of this, I have reinstalled VS Code, I updated my conda base environment and have to downgrade due to reported issues. I mean, I changed everything by with this new environments I am getting the same deterministic results every time.

So, what can I try? Is there a problem with MKL? Have no idea what this is.

Besides this annoying different unsuspected behaviour, is it ok that numpy returns a matrix with only some elements as nan when calling the inverse?
I don't think it would be correct, since if due to numeric reasons numpy cant provide an inverse, no element in the matrix should be numeric type. Otherwise, in all the scripts where we calculate an inverse we should need some evaluation as “(x==np.nan).any() or (x==np.nan+np.nan*1j).any()”. In case we calculate inverse and just pick some element (or part) of the inverse matrix we would be using an erroneous value. If we multiply to any other matrix or vector we get all nan output and this cases is covered but not the one I pointed out before.

leobiec · 2021-11-04T13:40:11Z

I used an other version of numpy also 1.21.2, the same in both environments

seberg · 2021-11-04T13:50:01Z

You could try creating an issue with Anaconda, maybe they at least know where to look. What your tries make almost fully certain is what we suspsected: The problem has to do with MKL (which is the Intel linear algebra library – or well a library which includes the linear algebra stuff). If you install with pip, you would get OpenBLAS (a different implementation). There should also be ways to remove MKL from your conda environment (or install OpenBLAS to choose that NumPy should use that instead).

bashtage · 2021-11-04T14:30:11Z

Edit: Looks like you did. Do you get identical results on the pip version as you do the conda version?

leobiec · 2021-11-04T15:25:23Z

Edit: Looks like you did. Do you get identical results on the pip version as you do the conda version?

I have tested the complete script output (easier for me) and I find a maximum relative error of 6.763964020728518e-15 (difference relative to the maximum absolute value of the final output matrix). I think (without any analysis) this might be in the order of the numeric type precision.

There is nothing random in the calculus and I double checked the difference does not change from one run to the next.

leobiec · 2021-11-04T15:27:04Z

@seberg, @bashtage

What can you tell me about the numpy output with only some nan elements in the matrix? is this ok or some validation must be added in numpy.linalg.inv?

bashtage · 2021-11-04T15:31:47Z

AFAICT you shouldn't have any NaN values, and their presence suggests a bug somewhere. If you don't get NaN in your pip install, then this is further evidence of a bug in MKL which provides the routines that are used to invert the array when you use the default conda forge install.

leobiec · 2021-11-04T15:35:59Z

You could try creating an issue with Anaconda, maybe they at least know where to look. What your tries make almost fully certain is what we suspsected: The problem has to do with MKL (which is the Intel linear algebra library – or well a library which includes the linear algebra stuff). If you install with pip, you would get OpenBLAS (a different implementation). There should also be ways to remove MKL from your conda environment (or install OpenBLAS to choose that NumPy should use that instead).

MKL unexpected behaviour in numpy.linalg.inv returning different values in successive calls with same input data #conda/conda#11023

bnavigator · 2021-11-04T20:29:18Z

I am seeing similar behavior on our GitHub Actions CI, which also uses conda/MKL and it drives me crazy. Sometimes the jobs succeed, sometimes they fail.

E.g. this test:

A = np.array([[-0.01+0.j,  0.  +0.j],
              [ 1.  +0.j,  0.  +0.j]])
B = np.array([0., 1.])
q = np.linalg.solve(A, B)

which should raise LinAlgError, but does not.

or this one:

A = np.array([[ 2.85933384+50.j, -0.02684555 +0.j,  0.23068221 +0.j],
              [-1.8278323  +0.j,  1.58857243+50.j,  0.87674723 +0.j],
              [-1.13524527 +0.j, -0.1945487  +0.j,  0.88113606+50.j]])
B = np.array([[-0.08275233],
              [-2.84057541],
              [-0.        ]])
q = np.linalg.solve(A, B)

should not return all nan, but does.

ilayn · 2021-11-04T20:29:38Z

Starting to smell like #16744 all over again.

seberg · 2021-11-04T20:47:13Z

Hmmm, if @bnavigator's issue is the same (and it looks very similar), then it is not windows specific, though! Do we know if the machines it fails on share e.g. certain CPU features/instruction sets? The end of np.show_config() may help a bit with guessing that.

@leobiec a bit of a long-shot, but can you reproduce this reliably if you repeat the operation in a loop (and it will eventually crash?)

bnavigator · 2021-11-04T21:06:54Z

? The end of np.show_config() may help a bit with guessing that.

Run source $CONDA/bin/activate test-environment
  source $CONDA/bin/activate test-environment
  # Use xvfb-run instead of pytest-xvfb to get proper mpl backend
  # Use coverage instead of pytest-cov to get .coverage file
  # See https://github.com/python-control/python-control/pull/504
  
  python3 -c 'import numpy; numpy.show_config()'
  
  xvfb-run --auto-servernum coverage run -m pytest control/tests
  shell: /usr/bin/bash -e {0}
  env:
    PYTHON_CONTROL_ARRAY_AND_MATRIX: 0
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
============================= test session starts ==============================
platform linux -- Python 3.6.13, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/runner/work/python-control/python-control, configfile: setup.cfg
plugins: timeout-1.4.2
collected 2758 items

Now strangely enough. This particular job did NOT fail.

seberg · 2021-11-04T21:14:21Z

My bad, I was aiming for the CPU flags that NumPy use, but that information was added only recently to show_config. This is what I was going for:

from numpy.core._multiarray_umath import (
        __cpu_features__, __cpu_baseline__, __cpu_dispatch__)

features_found, features_not_found = [], []
for feature in __cpu_dispatch__:
    if __cpu_features__[feature]:
        features_found.append(feature)
    else:
        features_not_found.append(feature)

print("Supported SIMD extensions in this NumPy install:")
print("    baseline = %s" % (','.join(__cpu_baseline__)))
print("    found = %s" % (','.join(features_found)))
print("    not found = %s" % (','.join(features_not_found)))

but maybe someone else has a better approach... If we know that it reproduces reliably with very minimal code, we could also see if we can reproduce it in C. That would make it certain that the issue is in MKL and not some interplay.

bnavigator · 2021-11-04T21:29:05Z

Another Github Actions job:

  blas               pkgs/main/linux-64::blas-1.0-mkl
  mkl                pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640
  mkl-service        pkgs/main/linux-64::mkl-service-2.4.0-py39h7f8727e_0
  mkl_fft            pkgs/main/linux-64::mkl_fft-1.3.1-py39hd3c417c_0
  mkl_random         pkgs/main/linux-64::mkl_random-1.2.2-py39h51133e4_0
  numpy              pkgs/main/linux-64::numpy-1.21.2-py39h20f2e39_0
  numpy-base         pkgs/main/linux-64::numpy-base-1.21.2-py39h79a1101_0
...

Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CNL

This one returned

----------------------------- Captured stdout call -----------------------------
DEBUG np.linalg.solve(
,[[ 0.01+0.j  0.  +0.j]
 [-1.  +0.j  0.  +0.j]]
[[1.]
 [0.]] = 
[[nan+nanj]
 [nan+nanj]]
___________________

where it should have raised LinAlgError.

bnavigator · 2021-11-04T21:38:02Z

BTW trying to disable mkl in conda with conda install nomkl produces gems like

>       w = _umath_linalg.eigvals(a, signature=signature, extobj=extobj)
E       ValueError: On entry to DHSEQR parameter number 4 had an illegal value

/usr/share/miniconda/envs/test-environment/lib/python3.9/site-packages/numpy/linalg/linalg.py:1068: ValueError

seberg · 2021-11-04T21:41:43Z

@bnavigator uhoh, that sounds like something is going wrong earlier... This also happens randomly, I guess?

bnavigator · 2021-11-04T21:44:59Z

It happens at every job I tried so far: bnavigator/python-control#2

mattip · 2021-11-04T21:46:41Z

When you disable MKL, what do you get instead, and what version of that other thing?

mattip · 2021-11-04T21:50:42Z

Ahh, I see libopenblas-0.3.13. That is unfortunate, since the latest OpenBLAS release is 0.3.18, and some maybe relevant issues were fixed in the newer version

bnavigator · 2021-11-04T21:54:37Z

That's what conda is pulling in: https://anaconda.org/anaconda/openblas

  blas               pkgs/main/linux-64::blas-1.0-openblas
  libopenblas        pkgs/main/linux-64::libopenblas-0.3.13-h4367d64_0
  nomkl              pkgs/main/linux-64::nomkl-3.0-0
  numpy              pkgs/main/linux-64::numpy-1.21.2-py39hd8d4704_0
  numpy-base         pkgs/main/linux-64::numpy-base-1.21.2-py39h2b8c604_0
  scipy              pkgs/main/linux-64::scipy-1.7.1-py39hc65b3f8_2

a = array([[-4., -3.],
       [ 1.,  0.]])

    @array_function_dispatch(_unary_dispatcher)
    def eigvals(a):
        """
        Compute the eigenvalues of a general matrix.
        """
....

>       w = _umath_linalg.eigvals(a, signature=signature, extobj=extobj)
E       ValueError: On entry to DHSEQR parameter number 4 had an illegal value

/usr/share/miniconda/envs/test-environment/lib/python3.9/site-packages/numpy/linalg/linalg.py:1068: ValueError

bnavigator · 2021-11-04T21:54:59Z

Conda seems utterly broken to me.

bashtage · 2021-11-04T21:58:09Z

Can you try pinning MKL to something like mkl=2021.1.1 or even older, mkl=2020.2? Curious if this is a new MKL issue.

leobiec · 2021-11-04T22:04:29Z

Hmmm, if @bnavigator's issue is the same (and it looks very similar), then it is not windows specific, though! Do we know if the machines it fails on share e.g. certain CPU features/instruction sets? The end of np.show_config() may help a bit with guessing that.

@leobiec a bit of a long-shot, but can you reproduce this reliably if you repeat the operation in a loop (and it will eventually crash?)

Hi @seberg. About 5 times i have ran my complete script with each conda and pip environment. I don't see nothing random. It seems that my code is getting to the point of failure with the same input all the times and making it crash the first time.

Please do not forget that the inmediate call again to linalg.inv with the same input does return an other value.

bashtage · 2021-11-04T22:06:55Z

I can't reproduce locally on either Windows or Linux with MKL 2021.4 and NumPy 1.21.2. Probably need to report CPU version so see if it matters.

bnavigator · 2021-11-04T22:08:39Z

Conda mkl-2021.2.0: Np debug mkl2021 bnavigator/python-control#5 -- if forces slycot to pull in netlib LAPACK and later fails on matplotlib despite run with xvfb. -- But the job without Slycot succeeds! 🟢
Conda-forge openblas: Conda forge blas bnavigator/python-control#4 🟢
Conda-forge mkl: Conda forge mkl bnavigator/python-control#6 🟢

bashtage · 2021-11-04T22:18:39Z

That OpenBlas was 0.3.17, so quite a bit newer than earlier.

bnavigator · 2021-11-04T22:20:12Z

Yeah, they probably don't care about openblas in the main conda much because MKL is the default.

bashtage · 2021-11-04T22:21:41Z

@bnavigator Your conda-forge MKL build is actually using openblas for some reason.

bnavigator · 2021-11-04T22:40:32Z

Thanks for noticing @bashtage. Fixed now. I now remember why setting up the CI for Slycot was such a convoluted task.

leobiec added the 00 - Bug label Oct 29, 2021

bashtage mentioned this issue Nov 2, 2021

BUG: np.linalg.eigvalsh and np.linalg.eigh returns arbitrary values for nan input, but they make different mistakes #20280

Open

bnavigator mentioned this issue Nov 4, 2021

np.linalg.solve broken on CI python-control/python-control#664

Closed

bnavigator mentioned this issue Nov 4, 2021

MKL unexpected behaviour in numpy.linalg.inv returning different values in successive calls with same input data conda/conda#11023

Closed

4 tasks

bnavigator mentioned this issue Nov 4, 2021

Use conda-forge for numpy and co. python-control/python-control#667

Merged

seberg mentioned this issue Nov 12, 2021

BUG: Calling np.exp(0) causes many lapack calls to return NaNs on CPUs with AVX512 and MKL #20356

Closed

grgmiller mentioned this issue Dec 28, 2022

Fix consumed calc singularity-energy/open-grid-emissions#266

Merged

This was referenced Feb 24, 2023

Remove scipy dependency by replacing eigvals() function ritchie46/anaStruct#138

Draft

Why both SciPy and NumPy? (edit: help wanted to work around an upstream Numpy bug!) ritchie46/anaStruct#137

Open

ricardoV94 mentioned this issue Oct 12, 2023

Flaky tensor/test_blockwise.py::TestMatrixInverse::test_grad pymc-devs/pytensor#431

Closed

BUG: numpy.linalg.inv returning different values on consecutive calls with some nan elements #20233

BUG: numpy.linalg.inv returning different values on consecutive calls with some nan elements #20233

Comments

leobiec commented Oct 29, 2021

Describe the issue:

Reproduce the code example:

Error message:

NumPy/Python version information:

leobiec commented Oct 29, 2021

seberg commented Oct 29, 2021

leobiec commented Oct 29, 2021

seberg commented Oct 29, 2021

bashtage commented Oct 30, 2021

seberg commented Oct 30, 2021

leobiec commented Nov 1, 2021

bashtage commented Nov 1, 2021 • edited

leobiec commented Nov 4, 2021

leobiec commented Nov 4, 2021

seberg commented Nov 4, 2021

bashtage commented Nov 4, 2021 • edited

leobiec commented Nov 4, 2021

leobiec commented Nov 4, 2021

bashtage commented Nov 4, 2021 • edited

leobiec commented Nov 4, 2021

bnavigator commented Nov 4, 2021 • edited

ilayn commented Nov 4, 2021

seberg commented Nov 4, 2021

bnavigator commented Nov 4, 2021 • edited by seberg

seberg commented Nov 4, 2021

bnavigator commented Nov 4, 2021

bnavigator commented Nov 4, 2021

seberg commented Nov 4, 2021

bnavigator commented Nov 4, 2021

mattip commented Nov 4, 2021

mattip commented Nov 4, 2021

bnavigator commented Nov 4, 2021

bnavigator commented Nov 4, 2021

bashtage commented Nov 4, 2021

leobiec commented Nov 4, 2021

bashtage commented Nov 4, 2021

bnavigator commented Nov 4, 2021 • edited

bashtage commented Nov 4, 2021

bnavigator commented Nov 4, 2021

bashtage commented Nov 4, 2021

bnavigator commented Nov 4, 2021

bashtage commented Nov 1, 2021 •

edited

bashtage commented Nov 4, 2021 •

edited

bashtage commented Nov 4, 2021 •

edited

bnavigator commented Nov 4, 2021 •

edited

bnavigator commented Nov 4, 2021 •

edited by seberg

bnavigator commented Nov 4, 2021 •

edited