Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: numpy.linalg.inv returning different values on consecutive calls with some nan elements #20233

Open
leobiec opened this issue Oct 29, 2021 · 37 comments
Labels

Comments

@leobiec
Copy link

leobiec commented Oct 29, 2021

Describe the issue:

I have a calculus of an inverse matrix in the middle of my code (some module developed by me) and debugging I found that numpy could not provide and inverse matrix. Also, only a few elements where reported as nan (don´t know if this is as expected or the complete matrix should be nan).

Anyway, if I stop the calculus at this point, and call again the inverse, it does provide an inverse matrix. Seems that something is changed in the numpy module in the first call, and not restored, so the next calls I have a different output.

The code I provide as an example does not work stand alone. I tried executing at the beginning of my complete code and it does provide the same result. So, something in my code previously has an effect on how numpy gets configured to the place where it does fail.

Sorry I cant provide the complete code and libraries. Don’t know how can I provide some stand alone code to reproduce this failure. Can I help the debugging of this issue in an other way?

Regards

Leonardo

Reproduce the code example:

a = np.array([[ 1.06902137e-08+1.03397577e-25j, -8.29800043e-09+1.46316136e-09j,
         1.06153905e-08+7.03229683e-10j, -8.26255952e-09+1.45691217e-09j],
       [-8.29800043e-09-1.46316136e-09j,  1.06902137e-08+0.00000000e+00j,
        -8.26255952e-09-1.45691217e-09j,  1.06153905e-08-7.03229683e-10j],
       [ 1.06153905e-08-7.03229683e-10j, -8.26255952e-09+1.45691217e-09j,
         1.06902137e-08+7.23783036e-25j, -8.29800043e-09+1.46316136e-09j],
       [-8.26255952e-09-1.45691217e-09j,  1.06153905e-08+7.03229683e-10j,
        -8.29800043e-09-1.46316136e-09j,  1.06902137e-08-7.23783036e-25j]])
print(np.linalg.inv(a))
print(np.linalg.inv(a))

[[            nan           +nanj -3.75115133e+11-5.40693281e+10j
  -3.79340139e+11-1.19099169e+11j  3.72964747e+11-6.57637077e+10j]
 [            nan           +nanj  3.98070598e+11-1.64825002e-01j
   3.72964747e+11+6.57637077e+10j -3.79340139e+11+1.19099169e+11j]
 [            nan           +nanj  3.72964747e+11-6.57637077e+10j
   3.98070588e+11-1.88123676e-01j -3.34000150e+11+1.79105405e+11j]
 [            nan+6.57637077e+10j -3.79340139e+11-1.19099169e+11j
  -3.34000150e+11-1.79105405e+11j  3.98070588e+11-1.64492045e-01j]]
[[ 3.98070598e+11-1.88316777e-01j -3.75115133e+11-5.40693281e+10j
  -3.79340139e+11-1.19099169e+11j  3.72964747e+11-6.57637077e+10j]
 [-3.75115133e+11+5.40693281e+10j  3.98070598e+11-1.64825002e-01j
   3.72964747e+11+6.57637077e+10j -3.79340139e+11+1.19099169e+11j]
 [-3.79340139e+11+1.19099169e+11j  3.72964747e+11-6.57637077e+10j
   3.98070588e+11-1.88123676e-01j -3.34000150e+11+1.79105405e+11j]
 [ 3.72964747e+11+6.57637077e+10j -3.79340139e+11-1.19099169e+11j
  -3.34000150e+11-1.79105405e+11j  3.98070588e+11-1.64492045e-01j]]

Error message:

No response

NumPy/Python version information:

numpy version 1.20.1

@leobiec
Copy link
Author

leobiec commented Oct 29, 2021

I just found that calling for the second time not allways solve the problem. I added the next code for the moment, and i found that at least one time it was calculated twice to get non nan result.

RpSubInv = np.linalg.inv(RpSub)
if np.isnan(RpSubInv).any():
    RpSubInv = np.linalg.inv(RpSub)
if np.isnan(RpSubInv).any():
    RpSubInv = np.linalg.inv(RpSub)
if np.isnan(RpSubInv).any():
    print('No way...')

@seberg
Copy link
Member

seberg commented Oct 29, 2021

@leobiec can you say how NumPy was installed, what OS you are using? giving the output of np.show_config() may also help. If there is an issue it is with the linear algebra implementation probably.

E.g. is this using openblas, MKL, threaded? We then should try things like running with OMP_NUM_THREADS=1, or try running/pasting this script (if it gives any reasonable output): https://gist.githubusercontent.com/seberg/ce4563f3cb00e33997e4f80675b80953/raw/535ffd09dfbd6d41860ecf1effd8edfcea698810/guess_numpy_blas.py

The matrix does have a very large condition-number and the singular values are all rather small, but I do get pretty much the same result, so I doubt that is the issue?

@leobiec
Copy link
Author

leobiec commented Oct 29, 2021

Hi @seberg,

Besides the matrix condition, i think the different results in succesive callings should not expected.

np.show_config()
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/ProgramData/Anaconda3\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/ProgramData/Anaconda3\\Library\\include']

The script returned:

Probing Multiarray
------------------
Unable to guess BLAS implementation, it is not one of: dict_values(['openblas', 'atlas', 'blis', 'MKL', 'accelerate'])
    or additional symbols are not loaded?!

and how can I do this:

We then should try things like running with OMP_NUM_THREADS=1

Don't have any problem in making a google meet to show you what is happening if it helps and to figure out what can we log

@seberg
Copy link
Member

seberg commented Oct 29, 2021

Well, it looks like you are using anaconda with MKL, so if there is an issue it is almost certainly with MKL. I am not an MKL specialist although someone else here may have an idea. Especially if you are threading yourself, trying to set the number of threads to 1 in mkl (e.g. using threadpoolctl may be worth a shot.

@bashtage
Copy link
Contributor

I get the same with MKL + Ryzen. You might have a hardware problem. You might try running a memory test.

@seberg
Copy link
Member

seberg commented Oct 30, 2021

Running a memory test could be interesting. My list of things to try would be:

  • Try to reduce the number of threads to 1 (disabling threading, especially if your app is threaded).
  • See if MKL can select a different "kernel" (optimization). Maybe not, and if you are on AMD newer MKL auto-selects crippled kernels anyway (but it may be that anaconda ships older MKLs for that exact reason)
  • Use OpenBLAS: It probably just means uninstalling some numpy+mkl conda package or so, but I forget since I do not use anaconda regularly.
  • To help see if there is something else going on: Figure out what is special about your app (multi-threading, ..., ?)

Switching to openblas is probably the easiest, promising, "hot fix" candidate, plus if problems persist we really know something else is very wrong.

@leobiec
Copy link
Author

leobiec commented Nov 1, 2021

Hi everyone.

@bashtage, thanks for your collaboration. I do get the same result if a run the example code I gave in the issue. As I mentioned, the code is not self-contained. But when ran in the main script (in debug or not debug mode) I get the different results.
I don't think it is a memory problem since it fails when running from Jupyter notebook, exported py and also debug mode (don't think the memory usage would be the same).

I think something inside numpy is changing in the middle of my code and again there in the first call.

Is there a way to save complete actual workspace status? Or an other way I can show you the failure since the code is not self-cointaned?

I am not doing any multi-threading in the code. If the modules or the conda environment is doing it I do not know. @seberg still you think it would be useful to switch to openblas? I don´t know how to do this and it will take a bit of time to figure out.

@bashtage
Copy link
Contributor

bashtage commented Nov 1, 2021

I don't think it is a memory problem since it fails when running from Jupyter notebook, exported py and also debug mode (don't think the memory usage would be the same).

IME this is precisely what I would expect if it was something deep like a hardware problem. Things like physical memory are all over the place and would affect python in any environment.

Do you see the same if you install all packages using pip (this will use OpenBLAS)? Basically

conda create -n use-pip python=3.8
conda activate use-pip
pip install numpy ipython

then run your script.

I think something inside numpy is changing in the middle of my code and again there in the first call.

IMO the only way something internal is changing is if memory is corrupt so that it is changing effectively at random.

@leobiec
Copy link
Author

leobiec commented Nov 4, 2021

So, I run memory test and nothing wrong was found.

I created two environments both with conda create (don't know if it is ok or if I must have created a pip virtual env) but in one of them I installed with pip and in the other with conda.

name: basic_conda_env
channels:
  - defaults
dependencies:
  - argon2-cffi=20.1.0=py38h2bbff1b_1
  - async_generator=1.10=pyhd3eb1b0_0
  - attrs=21.2.0=pyhd3eb1b0_0
  - backcall=0.2.0=pyhd3eb1b0_0
  - blas=1.0=mkl
  - bleach=4.0.0=pyhd3eb1b0_0
  - brotli=1.0.9=ha925a31_2
  - ca-certificates=2021.10.26=haa95532_2
  - certifi=2021.10.8=py38haa95532_0
  - cffi=1.14.6=py38h2bbff1b_0
  - colorama=0.4.4=pyhd3eb1b0_0
  - cycler=0.10.0=py38_0
  - debugpy=1.4.1=py38hd77b12b_0
  - decorator=5.1.0=pyhd3eb1b0_0
  - defusedxml=0.7.1=pyhd3eb1b0_0
  - entrypoints=0.3=py38_0
  - fonttools=4.25.0=pyhd3eb1b0_0
  - freetype=2.11.0=ha860e81_0
  - icc_rt=2019.0.0=h0cc432a_1
  - icu=58.2=ha925a31_3
  - importlib-metadata=4.8.1=py38haa95532_0
  - importlib_metadata=4.8.1=hd3eb1b0_0
  - intel-openmp=2021.4.0=haa95532_3556
  - ipykernel=6.4.1=py38haa95532_1
  - ipython=7.29.0=py38hd4e2768_0
  - ipython_genutils=0.2.0=pyhd3eb1b0_1
  - ipywidgets=7.6.5=pyhd3eb1b0_1
  - jedi=0.18.0=py38haa95532_1
  - jinja2=3.0.2=pyhd3eb1b0_0
  - jpeg=9d=h2bbff1b_0
  - jsonschema=3.2.0=pyhd3eb1b0_2
  - jupyter=1.0.0=py38_7
  - jupyter_client=7.0.1=pyhd3eb1b0_0
  - jupyter_console=6.4.0=pyhd3eb1b0_0
  - jupyter_core=4.8.1=py38haa95532_0
  - jupyterlab_pygments=0.1.2=py_0
  - jupyterlab_widgets=1.0.0=pyhd3eb1b0_1
  - kiwisolver=1.3.1=py38hd77b12b_0
  - libpng=1.6.37=h2a8f88b_0
  - libtiff=4.2.0=hd0e1b90_0
  - libwebp=1.2.0=h2bbff1b_0
  - lz4-c=1.9.3=h2bbff1b_1
  - m2w64-gcc-libgfortran=5.3.0=6
  - m2w64-gcc-libs=5.3.0=7
  - m2w64-gcc-libs-core=5.3.0=7
  - m2w64-gmp=6.1.0=2
  - m2w64-libwinpthread-git=5.0.0.4634.697f757=2
  - markupsafe=2.0.1=py38h2bbff1b_0
  - matplotlib=3.4.3=py38haa95532_0
  - matplotlib-base=3.4.3=py38h49ac443_0
  - matplotlib-inline=0.1.2=pyhd3eb1b0_2
  - mistune=0.8.4=py38he774522_1000
  - mkl=2021.4.0=haa95532_640
  - mkl-service=2.4.0=py38h2bbff1b_0
  - mkl_fft=1.3.1=py38h277e83a_0
  - mkl_random=1.2.2=py38hf11a4ad_0
  - msys2-conda-epoch=20160418=1
  - munkres=1.1.4=py_0
  - nbclient=0.5.3=pyhd3eb1b0_0
  - nbconvert=6.1.0=py38haa95532_0
  - nbformat=5.1.3=pyhd3eb1b0_0
  - nest-asyncio=1.5.1=pyhd3eb1b0_0
  - notebook=6.4.5=py38haa95532_0
  - numpy=1.21.2=py38hfca59bb_0
  - numpy-base=1.21.2=py38h0829f74_0
  - olefile=0.46=pyhd3eb1b0_0
  - openssl=1.1.1l=h2bbff1b_0
  - packaging=21.0=pyhd3eb1b0_0
  - pandocfilters=1.4.3=py38haa95532_1
  - parso=0.8.2=pyhd3eb1b0_0
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pillow=8.4.0=py38hd45dc43_0
  - pip=21.0.1=py38haa95532_0
  - prometheus_client=0.11.0=pyhd3eb1b0_0
  - prompt-toolkit=3.0.20=pyhd3eb1b0_0
  - prompt_toolkit=3.0.20=hd3eb1b0_0
  - pycparser=2.20=py_2
  - pygments=2.10.0=pyhd3eb1b0_0
  - pyparsing=3.0.4=pyhd3eb1b0_0
  - pyqt=5.9.2=py38ha925a31_4
  - pyrsistent=0.17.3=py38he774522_0
  - python=3.8.12=h6244533_0
  - python-dateutil=2.8.2=pyhd3eb1b0_0
  - pywin32=228=py38hbaba5e8_1
  - pywinpty=0.5.7=py38_0
  - pyzmq=22.2.1=py38hd77b12b_1
  - qt=5.9.7=vc14h73c81de_0
  - qtconsole=5.1.1=pyhd3eb1b0_0
  - qtpy=1.10.0=pyhd3eb1b0_0
  - scipy=1.7.1=py38hbe87c03_2
  - send2trash=1.8.0=pyhd3eb1b0_1
  - setuptools=58.0.4=py38haa95532_0
  - sip=4.19.13=py38ha925a31_0
  - six=1.16.0=pyhd3eb1b0_0
  - sqlite=3.36.0=h2bbff1b_0
  - terminado=0.9.4=py38haa95532_0
  - testpath=0.5.0=pyhd3eb1b0_0
  - tk=8.6.11=h2bbff1b_0
  - tornado=6.1=py38h2bbff1b_0
  - traitlets=5.1.0=pyhd3eb1b0_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wcwidth=0.2.5=pyhd3eb1b0_0
  - webencodings=0.5.1=py38_1
  - wheel=0.37.0=pyhd3eb1b0_1
  - widgetsnbextension=3.5.1=py38_0
  - wincertstore=0.2=py38haa95532_2
  - winpty=0.4.3=4
  - xz=5.2.5=h62dcd97_0
  - zipp=3.6.0=pyhd3eb1b0_0
  - zlib=1.2.11=h62dcd97_4
  - zstd=1.4.9=h19a0ad4_0

and

name: basic_pip_env
channels:
  - defaults
dependencies:
  - ca-certificates=2021.10.26=haa95532_2
  - certifi=2021.10.8=py38haa95532_0
  - openssl=1.1.1l=h2bbff1b_0
  - pip=21.0.1=py38haa95532_0
  - python=3.8.12=h6244533_0
  - setuptools=58.0.4=py38haa95532_0
  - sqlite=3.36.0=h2bbff1b_0
  - vc=14.2=h21ff451_1
  - vs2015_runtime=14.27.29016=h5e58377_2
  - wheel=0.37.0=pyhd3eb1b0_1
  - wincertstore=0.2=py38haa95532_2
  - pip:
    - argon2-cffi==21.1.0
    - attrs==21.2.0
    - backcall==0.2.0
    - bleach==4.1.0
    - cffi==1.15.0
    - colorama==0.4.4
    - cycler==0.11.0
    - debugpy==1.5.1
    - decorator==5.1.0
    - defusedxml==0.7.1
    - entrypoints==0.3
    - ipykernel==6.5.0
    - ipython==7.29.0
    - ipython-genutils==0.2.0
    - ipywidgets==7.6.5
    - jedi==0.18.0
    - jinja2==3.0.2
    - jsonschema==4.1.2
    - jupyter==1.0.0
    - jupyter-client==7.0.6
    - jupyter-console==6.4.0
    - jupyter-core==4.9.1
    - jupyterlab-pygments==0.1.2
    - jupyterlab-widgets==1.0.2
    - kiwisolver==1.3.2
    - markupsafe==2.0.1
    - matplotlib==3.4.3
    - matplotlib-inline==0.1.3
    - mistune==0.8.4
    - nbclient==0.5.4
    - nbconvert==6.2.0
    - nbformat==5.1.3
    - nest-asyncio==1.5.1
    - notebook==6.4.5
    - numpy==1.21.2
    - packaging==21.2
    - pandocfilters==1.5.0
    - parso==0.8.2
    - pickleshare==0.7.5
    - pillow==8.4.0
    - prometheus-client==0.12.0
    - prompt-toolkit==3.0.21
    - pycparser==2.20
    - pygments==2.10.0
    - pyparsing==2.4.7
    - pyrsistent==0.18.0
    - python-dateutil==2.8.2
    - pywin32==302
    - pywinpty==1.1.5
    - pyzmq==22.3.0
    - qtconsole==5.1.1
    - qtpy==1.11.2
    - scipy==1.7.1
    - send2trash==1.8.0
    - six==1.16.0
    - terminado==0.12.1
    - testpath==0.5.0
    - tornado==6.1
    - traitlets==5.1.1
    - wcwidth==0.2.5
    - webencodings==0.5.1
    - widgetsnbextension==3.5.2

Now, if I run in basic_conda_env both in debug mode, running to terminal, or in Jupyter notewbook I see numpy delivering the wrong output (the matrix with some nan elements). Each of them solved with one iteration as I mentioned before. The number of times it fails is not the same if I run in jupyter notebook as in the other cases… But for each cases, the number of failures does not seems to be changing at all!

If I run with the basic_pip_env it does not fail at all in any case!!

In the middle of this, I have reinstalled VS Code, I updated my conda base environment and have to downgrade due to reported issues. I mean, I changed everything by with this new environments I am getting the same deterministic results every time.

So, what can I try? Is there a problem with MKL? Have no idea what this is.

Besides this annoying different unsuspected behaviour, is it ok that numpy returns a matrix with only some elements as nan when calling the inverse?
I don't think it would be correct, since if due to numeric reasons numpy cant provide an inverse, no element in the matrix should be numeric type. Otherwise, in all the scripts where we calculate an inverse we should need some evaluation as “(x==np.nan).any() or (x==np.nan+np.nan*1j).any()”. In case we calculate inverse and just pick some element (or part) of the inverse matrix we would be using an erroneous value. If we multiply to any other matrix or vector we get all nan output and this cases is covered but not the one I pointed out before.

@leobiec
Copy link
Author

leobiec commented Nov 4, 2021

I used an other version of numpy also 1.21.2, the same in both environments

@seberg
Copy link
Member

seberg commented Nov 4, 2021

You could try creating an issue with Anaconda, maybe they at least know where to look. What your tries make almost fully certain is what we suspsected: The problem has to do with MKL (which is the Intel linear algebra library – or well a library which includes the linear algebra stuff). If you install with pip, you would get OpenBLAS (a different implementation). There should also be ways to remove MKL from your conda environment (or install OpenBLAS to choose that NumPy should use that instead).

@bashtage
Copy link
Contributor

bashtage commented Nov 4, 2021

Edit: Looks like you did. Do you get identical results on the pip version as you do the conda version?

@leobiec
Copy link
Author

leobiec commented Nov 4, 2021

Edit: Looks like you did. Do you get identical results on the pip version as you do the conda version?

I have tested the complete script output (easier for me) and I find a maximum relative error of 6.763964020728518e-15 (difference relative to the maximum absolute value of the final output matrix). I think (without any analysis) this might be in the order of the numeric type precision.

There is nothing random in the calculus and I double checked the difference does not change from one run to the next.

@leobiec
Copy link
Author

leobiec commented Nov 4, 2021

@seberg, @bashtage

What can you tell me about the numpy output with only some nan elements in the matrix? is this ok or some validation must be added in numpy.linalg.inv?

@bashtage
Copy link
Contributor

bashtage commented Nov 4, 2021

AFAICT you shouldn't have any NaN values, and their presence suggests a bug somewhere. If you don't get NaN in your pip install, then this is further evidence of a bug in MKL which provides the routines that are used to invert the array when you use the default conda forge install.

@leobiec
Copy link
Author

leobiec commented Nov 4, 2021

You could try creating an issue with Anaconda, maybe they at least know where to look. What your tries make almost fully certain is what we suspsected: The problem has to do with MKL (which is the Intel linear algebra library – or well a library which includes the linear algebra stuff). If you install with pip, you would get OpenBLAS (a different implementation). There should also be ways to remove MKL from your conda environment (or install OpenBLAS to choose that NumPy should use that instead).

MKL unexpected behaviour in numpy.linalg.inv returning different values in successive calls with same input data #conda/conda#11023

@bnavigator
Copy link
Contributor

bnavigator commented Nov 4, 2021

I am seeing similar behavior on our GitHub Actions CI, which also uses conda/MKL and it drives me crazy. Sometimes the jobs succeed, sometimes they fail.

E.g. this test:

A = np.array([[-0.01+0.j,  0.  +0.j],
              [ 1.  +0.j,  0.  +0.j]])
B = np.array([0., 1.])
q = np.linalg.solve(A, B)

which should raise LinAlgError, but does not.

or this one:

A = np.array([[ 2.85933384+50.j, -0.02684555 +0.j,  0.23068221 +0.j],
              [-1.8278323  +0.j,  1.58857243+50.j,  0.87674723 +0.j],
              [-1.13524527 +0.j, -0.1945487  +0.j,  0.88113606+50.j]])
B = np.array([[-0.08275233],
              [-2.84057541],
              [-0.        ]])
q = np.linalg.solve(A, B)

should not return all nan, but does.

@ilayn
Copy link
Contributor

ilayn commented Nov 4, 2021

Starting to smell like #16744 all over again.

@seberg
Copy link
Member

seberg commented Nov 4, 2021

Hmmm, if @bnavigator's issue is the same (and it looks very similar), then it is not windows specific, though! Do we know if the machines it fails on share e.g. certain CPU features/instruction sets? The end of np.show_config() may help a bit with guessing that.

@leobiec a bit of a long-shot, but can you reproduce this reliably if you repeat the operation in a loop (and it will eventually crash?)

@bnavigator
Copy link
Contributor

bnavigator commented Nov 4, 2021

? The end of np.show_config() may help a bit with guessing that.

Run source $CONDA/bin/activate test-environment
  source $CONDA/bin/activate test-environment
  # Use xvfb-run instead of pytest-xvfb to get proper mpl backend
  # Use coverage instead of pytest-cov to get .coverage file
  # See https://github.com/python-control/python-control/pull/504
  
  python3 -c 'import numpy; numpy.show_config()'
  
  xvfb-run --auto-servernum coverage run -m pytest control/tests
  shell: /usr/bin/bash -e {0}
  env:
    PYTHON_CONTROL_ARRAY_AND_MATRIX: 0
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/usr/share/miniconda/envs/test-environment/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/share/miniconda/envs/test-environment/include']
============================= test session starts ==============================
platform linux -- Python 3.6.13, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/runner/work/python-control/python-control, configfile: setup.cfg
plugins: timeout-1.4.2
collected 2758 items

Now strangely enough. This particular job did NOT fail.

@seberg
Copy link
Member

seberg commented Nov 4, 2021

My bad, I was aiming for the CPU flags that NumPy use, but that information was added only recently to show_config. This is what I was going for:

from numpy.core._multiarray_umath import (
        __cpu_features__, __cpu_baseline__, __cpu_dispatch__)

features_found, features_not_found = [], []
for feature in __cpu_dispatch__:
    if __cpu_features__[feature]:
        features_found.append(feature)
    else:
        features_not_found.append(feature)

print("Supported SIMD extensions in this NumPy install:")
print("    baseline = %s" % (','.join(__cpu_baseline__)))
print("    found = %s" % (','.join(features_found)))
print("    not found = %s" % (','.join(features_not_found)))

but maybe someone else has a better approach... If we know that it reproduces reliably with very minimal code, we could also see if we can reproduce it in C. That would make it certain that the issue is in MKL and not some interplay.

@bnavigator
Copy link
Contributor

Another Github Actions job:

  blas               pkgs/main/linux-64::blas-1.0-mkl
  mkl                pkgs/main/linux-64::mkl-2021.4.0-h06a4308_640
  mkl-service        pkgs/main/linux-64::mkl-service-2.4.0-py39h7f8727e_0
  mkl_fft            pkgs/main/linux-64::mkl_fft-1.3.1-py39hd3c417c_0
  mkl_random         pkgs/main/linux-64::mkl_random-1.2.2-py39h51133e4_0
  numpy              pkgs/main/linux-64::numpy-1.21.2-py39h20f2e39_0
  numpy-base         pkgs/main/linux-64::numpy-base-1.21.2-py39h79a1101_0
...

Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CNL

This one returned

----------------------------- Captured stdout call -----------------------------
DEBUG np.linalg.solve(
,[[ 0.01+0.j  0.  +0.j]
 [-1.  +0.j  0.  +0.j]]
[[1.]
 [0.]] = 
[[nan+nanj]
 [nan+nanj]]
___________________

where it should have raised LinAlgError.

@bnavigator
Copy link
Contributor

BTW trying to disable mkl in conda with conda install nomkl produces gems like

>       w = _umath_linalg.eigvals(a, signature=signature, extobj=extobj)
E       ValueError: On entry to DHSEQR parameter number 4 had an illegal value

/usr/share/miniconda/envs/test-environment/lib/python3.9/site-packages/numpy/linalg/linalg.py:1068: ValueError

@seberg
Copy link
Member

seberg commented Nov 4, 2021

@bnavigator uhoh, that sounds like something is going wrong earlier... This also happens randomly, I guess?

@bnavigator
Copy link
Contributor

It happens at every job I tried so far: bnavigator/python-control#2

@mattip
Copy link
Member

mattip commented Nov 4, 2021

When you disable MKL, what do you get instead, and what version of that other thing?

@mattip
Copy link
Member

mattip commented Nov 4, 2021

Ahh, I see libopenblas-0.3.13. That is unfortunate, since the latest OpenBLAS release is 0.3.18, and some maybe relevant issues were fixed in the newer version

@bnavigator
Copy link
Contributor

That's what conda is pulling in: https://anaconda.org/anaconda/openblas

  blas               pkgs/main/linux-64::blas-1.0-openblas
  libopenblas        pkgs/main/linux-64::libopenblas-0.3.13-h4367d64_0
  nomkl              pkgs/main/linux-64::nomkl-3.0-0
  numpy              pkgs/main/linux-64::numpy-1.21.2-py39hd8d4704_0
  numpy-base         pkgs/main/linux-64::numpy-base-1.21.2-py39h2b8c604_0
  scipy              pkgs/main/linux-64::scipy-1.7.1-py39hc65b3f8_2
a = array([[-4., -3.],
       [ 1.,  0.]])

    @array_function_dispatch(_unary_dispatcher)
    def eigvals(a):
        """
        Compute the eigenvalues of a general matrix.
        """
....

>       w = _umath_linalg.eigvals(a, signature=signature, extobj=extobj)
E       ValueError: On entry to DHSEQR parameter number 4 had an illegal value

/usr/share/miniconda/envs/test-environment/lib/python3.9/site-packages/numpy/linalg/linalg.py:1068: ValueError

@bnavigator
Copy link
Contributor

Conda seems utterly broken to me.

@bashtage
Copy link
Contributor

bashtage commented Nov 4, 2021

Can you try pinning MKL to something like mkl=2021.1.1 or even older, mkl=2020.2? Curious if this is a new MKL issue.

@leobiec
Copy link
Author

leobiec commented Nov 4, 2021

Hmmm, if @bnavigator's issue is the same (and it looks very similar), then it is not windows specific, though! Do we know if the machines it fails on share e.g. certain CPU features/instruction sets? The end of np.show_config() may help a bit with guessing that.

@leobiec a bit of a long-shot, but can you reproduce this reliably if you repeat the operation in a loop (and it will eventually crash?)

Hi @seberg. About 5 times i have ran my complete script with each conda and pip environment. I don't see nothing random. It seems that my code is getting to the point of failure with the same input all the times and making it crash the first time.

Please do not forget that the inmediate call again to linalg.inv with the same input does return an other value.

@bashtage
Copy link
Contributor

bashtage commented Nov 4, 2021

I can't reproduce locally on either Windows or Linux with MKL 2021.4 and NumPy 1.21.2. Probably need to report CPU version so see if it matters.

@bnavigator
Copy link
Contributor

bnavigator commented Nov 4, 2021

@bashtage
Copy link
Contributor

bashtage commented Nov 4, 2021

That OpenBlas was 0.3.17, so quite a bit newer than earlier.

@bnavigator
Copy link
Contributor

Yeah, they probably don't care about openblas in the main conda much because MKL is the default.

@bashtage
Copy link
Contributor

bashtage commented Nov 4, 2021

@bnavigator Your conda-forge MKL build is actually using openblas for some reason.

@bnavigator
Copy link
Contributor

Thanks for noticing @bashtage. Fixed now. I now remember why setting up the CI for Slycot was such a convoluted task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants