New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: numpy.linalg.inv returning different values on consecutive calls with some nan elements #20233
Comments
I just found that calling for the second time not allways solve the problem. I added the next code for the moment, and i found that at least one time it was calculated twice to get non nan result.
|
@leobiec can you say how NumPy was installed, what OS you are using? giving the output of E.g. is this using openblas, MKL, threaded? We then should try things like running with The matrix does have a very large condition-number and the singular values are all rather small, but I do get pretty much the same result, so I doubt that is the issue? |
Hi @seberg, Besides the matrix condition, i think the different results in succesive callings should not expected.
The script returned:
and how can I do this:
Don't have any problem in making a google meet to show you what is happening if it helps and to figure out what can we log |
Well, it looks like you are using anaconda with MKL, so if there is an issue it is almost certainly with MKL. I am not an MKL specialist although someone else here may have an idea. Especially if you are threading yourself, trying to set the number of threads to 1 in mkl (e.g. using |
I get the same with MKL + Ryzen. You might have a hardware problem. You might try running a memory test. |
Running a memory test could be interesting. My list of things to try would be:
Switching to openblas is probably the easiest, promising, "hot fix" candidate, plus if problems persist we really know something else is very wrong. |
Hi everyone. @bashtage, thanks for your collaboration. I do get the same result if a run the example code I gave in the issue. As I mentioned, the code is not self-contained. But when ran in the main script (in debug or not debug mode) I get the different results. I think something inside numpy is changing in the middle of my code and again there in the first call. Is there a way to save complete actual workspace status? Or an other way I can show you the failure since the code is not self-cointaned? I am not doing any multi-threading in the code. If the modules or the conda environment is doing it I do not know. @seberg still you think it would be useful to switch to openblas? I don´t know how to do this and it will take a bit of time to figure out. |
IME this is precisely what I would expect if it was something deep like a hardware problem. Things like physical memory are all over the place and would affect python in any environment. Do you see the same if you install all packages using pip (this will use OpenBLAS)? Basically
then run your script.
IMO the only way something internal is changing is if memory is corrupt so that it is changing effectively at random. |
So, I run memory test and nothing wrong was found. I created two environments both with conda create (don't know if it is ok or if I must have created a pip virtual env) but in one of them I installed with pip and in the other with conda.
and
Now, if I run in basic_conda_env both in debug mode, running to terminal, or in Jupyter notewbook I see numpy delivering the wrong output (the matrix with some nan elements). Each of them solved with one iteration as I mentioned before. The number of times it fails is not the same if I run in jupyter notebook as in the other cases… But for each cases, the number of failures does not seems to be changing at all! If I run with the basic_pip_env it does not fail at all in any case!! In the middle of this, I have reinstalled VS Code, I updated my conda base environment and have to downgrade due to reported issues. I mean, I changed everything by with this new environments I am getting the same deterministic results every time. So, what can I try? Is there a problem with MKL? Have no idea what this is. Besides this annoying different unsuspected behaviour, is it ok that numpy returns a matrix with only some elements as nan when calling the inverse? |
I used an other version of numpy also 1.21.2, the same in both environments |
You could try creating an issue with Anaconda, maybe they at least know where to look. What your tries make almost fully certain is what we suspsected: The problem has to do with MKL (which is the Intel linear algebra library – or well a library which includes the linear algebra stuff). If you install with |
Edit: Looks like you did. Do you get identical results on the pip version as you do the conda version? |
I have tested the complete script output (easier for me) and I find a maximum relative error of 6.763964020728518e-15 (difference relative to the maximum absolute value of the final output matrix). I think (without any analysis) this might be in the order of the numeric type precision. There is nothing random in the calculus and I double checked the difference does not change from one run to the next. |
AFAICT you shouldn't have any |
MKL unexpected behaviour in numpy.linalg.inv returning different values in successive calls with same input data #conda/conda#11023 |
I am seeing similar behavior on our GitHub Actions CI, which also uses conda/MKL and it drives me crazy. Sometimes the jobs succeed, sometimes they fail. E.g. this test: A = np.array([[-0.01+0.j, 0. +0.j],
[ 1. +0.j, 0. +0.j]])
B = np.array([0., 1.])
q = np.linalg.solve(A, B) which should raise or this one: A = np.array([[ 2.85933384+50.j, -0.02684555 +0.j, 0.23068221 +0.j],
[-1.8278323 +0.j, 1.58857243+50.j, 0.87674723 +0.j],
[-1.13524527 +0.j, -0.1945487 +0.j, 0.88113606+50.j]])
B = np.array([[-0.08275233],
[-2.84057541],
[-0. ]])
q = np.linalg.solve(A, B) should not return all |
Starting to smell like #16744 all over again. |
Hmmm, if @bnavigator's issue is the same (and it looks very similar), then it is not windows specific, though! Do we know if the machines it fails on share e.g. certain CPU features/instruction sets? The end of @leobiec a bit of a long-shot, but can you reproduce this reliably if you repeat the operation in a loop (and it will eventually crash?) |
Now strangely enough. This particular job did NOT fail. |
My bad, I was aiming for the CPU flags that NumPy use, but that information was added only recently to
but maybe someone else has a better approach... If we know that it reproduces reliably with very minimal code, we could also see if we can reproduce it in C. That would make it certain that the issue is in MKL and not some interplay. |
This one returned
where it should have raised |
BTW trying to disable mkl in conda with
|
@bnavigator uhoh, that sounds like something is going wrong earlier... This also happens randomly, I guess? |
It happens at every job I tried so far: bnavigator/python-control#2 |
When you disable MKL, what do you get instead, and what version of that other thing? |
Ahh, I see libopenblas-0.3.13. That is unfortunate, since the latest OpenBLAS release is 0.3.18, and some maybe relevant issues were fixed in the newer version |
That's what conda is pulling in: https://anaconda.org/anaconda/openblas
a = array([[-4., -3.],
[ 1., 0.]])
@array_function_dispatch(_unary_dispatcher)
def eigvals(a):
"""
Compute the eigenvalues of a general matrix.
"""
....
> w = _umath_linalg.eigvals(a, signature=signature, extobj=extobj)
E ValueError: On entry to DHSEQR parameter number 4 had an illegal value
/usr/share/miniconda/envs/test-environment/lib/python3.9/site-packages/numpy/linalg/linalg.py:1068: ValueError |
Conda seems utterly broken to me. |
Can you try pinning MKL to something like |
Hi @seberg. About 5 times i have ran my complete script with each conda and pip environment. I don't see nothing random. It seems that my code is getting to the point of failure with the same input all the times and making it crash the first time. Please do not forget that the inmediate call again to linalg.inv with the same input does return an other value. |
I can't reproduce locally on either Windows or Linux with MKL 2021.4 and NumPy 1.21.2. Probably need to report CPU version so see if it matters. |
|
That OpenBlas was 0.3.17, so quite a bit newer than earlier. |
Yeah, they probably don't care about openblas in the main conda much because MKL is the default. |
@bnavigator Your conda-forge MKL build is actually using openblas for some reason. |
Thanks for noticing @bashtage. Fixed now. I now remember why setting up the CI for Slycot was such a convoluted task. |
Describe the issue:
I have a calculus of an inverse matrix in the middle of my code (some module developed by me) and debugging I found that numpy could not provide and inverse matrix. Also, only a few elements where reported as nan (don´t know if this is as expected or the complete matrix should be nan).
Anyway, if I stop the calculus at this point, and call again the inverse, it does provide an inverse matrix. Seems that something is changed in the numpy module in the first call, and not restored, so the next calls I have a different output.
The code I provide as an example does not work stand alone. I tried executing at the beginning of my complete code and it does provide the same result. So, something in my code previously has an effect on how numpy gets configured to the place where it does fail.
Sorry I cant provide the complete code and libraries. Don’t know how can I provide some stand alone code to reproduce this failure. Can I help the debugging of this issue in an other way?
Regards
Leonardo
Reproduce the code example:
Error message:
No response
NumPy/Python version information:
numpy version 1.20.1
The text was updated successfully, but these errors were encountered: