Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polyfit and eig regression tests fail after Windows 10 update to 2004 #16744

Closed
Caiptain1 opened this issue Jul 3, 2020 · 182 comments
Closed

polyfit and eig regression tests fail after Windows 10 update to 2004 #16744

Caiptain1 opened this issue Jul 3, 2020 · 182 comments
Labels
00 - Bug 32 - Installation Problems installing or compiling NumPy

Comments

@Caiptain1
Copy link

Tests are failing:
FAILED ....\lib\tests\test_regression.py::TestRegression::test_polyfit_build - numpy.linalg.LinAlgError: SVD did not...
FAILED ....\linalg\tests\test_regression.py::TestRegression::test_eig_build - numpy.linalg.LinAlgError: Eigenvalues ...
FAILED ....\ma\tests\test_extras.py::TestPolynomial::test_polyfit - numpy.linalg.LinAlgError: SVD did not converge i...

with exceptions:

err = 'invalid value', flag = 8
    def _raise_linalgerror_lstsq(err, flag):
>       raise LinAlgError("SVD did not converge in Linear Least Squares")
E       numpy.linalg.LinAlgError: SVD did not converge in Linear Least Squares
err        = 'invalid value'
flag       = 8

and

err = 'invalid value', flag = 8
    def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):
>       raise LinAlgError("Eigenvalues did not converge")
E       numpy.linalg.LinAlgError: Eigenvalues did not converge
err        = 'invalid value'
flag       = 8

Steps taken:

  • Create a VM
  • Install latest Windows 10 and update to the latest version 2004 (10.0.19041)
  • Install Python 3.8.3
  • pip install pytest
  • pip install numpy
  • pip install hypothesis
  • run tests in the package

Same happens issue happens when running on tests in the repository.

Version 1.19.0 of numpy

Am I missing any dependencies? Or is it just Windows going bonkers?

@charris charris added this to the 1.19.1 release milestone Jul 3, 2020
@bashtage
Copy link
Contributor

bashtage commented Jul 3, 2020

Edit: You are obviously using pip.I have also had a strange result on Windows AMD64 in the recent past with linear algebra libraries and eigenvalue decompositions (in the context of running test for statsmodels).

If you have time, try using 32 bit Python and pip and see if you get the same issues? I couldn't see them on 32-bit windows but they were repeatable on 64-bit windows.

If I use conda, which ships MKL, I don't have the errors.

Edit: I also see them when using NumPy 1.18.5 on Windows AMD64.

@charris
Copy link
Member

charris commented Jul 3, 2020

tests fail after latest Windows 10 update

Were the tests failing before the update?

@speixoto
Copy link

speixoto commented Jul 6, 2020

No @charris , before the update the test suite just passes.

@bashtage
Copy link
Contributor

bashtage commented Jul 6, 2020

@speixoto Do you know which update it was specifically? I'd be interested to see if it solves my issue with pip-installed wheels.

@speixoto
Copy link

speixoto commented Jul 6, 2020

Update 1809 (10.0.17763) was not causing any failed test @bashtage

@Caiptain1
Copy link
Author

1909 wasn't causing anything as well. It only started happening with 2004.

@bashtage
Copy link
Contributor

bashtage commented Jul 6, 2020

I'm not 100% convinced it is 2004 or something after. I think 2004 worked.

FWIW I can't reproduce these crashes on CI (Azure or appveyor) but I do it is locally on 2 machines that are both AMD64 and update on 2004.

Have either of you tried to see if you get them on 32-bit Python?

@charris
Copy link
Member

charris commented Jul 6, 2020

Seems there have been a number of problems reported against the 2004 update. Maybe this should be reported also?

@bashtage
Copy link
Contributor

bashtage commented Jul 7, 2020

I just ran the following on fresh install of 1909 and 2004:

pip install numpy scipy pandas pytest cython
git clone https://github.com/statsmodels/statsmodels.git
cd statsmodels
pip install -e . --no-bulid-isolation
pytest statsmodels

On 1909 no failures. On 2004 30 failures all related to linear algebra functions.

When I run tests on 2004 in a debugger, I notice that the first call to a function often returns in incorrect result, but calling again produces the correct result (which remains correct if repeatedly called). Not sure if this is useful information as to guessing a cause.

@charris
Copy link
Member

charris commented Jul 7, 2020

Do earlier versions of NumPy also have problems? I assume you are running 1.19.0.

@charris charris changed the title polyfit and eig regretion tests fail after latest Windows 10 update polyfit and eig regression tests fail after latest Windows 10 update Jul 8, 2020
@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

Using pip + 1.18.4, and scipy 1.4.1, I get the same set of errors.

These are really common:

ERROR statsmodels/graphics/tests/test_gofplots.py::TestProbPlotLongely::test_ppplot - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/graphics/tests/test_gofplots.py::TestProbPlotLongely::test_qqplot_other_array - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/graphics/tests/test_gofplots.py::TestProbPlotLongely::test_probplot - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/graphics/tests/test_regressionplots.py::TestPlot::test_plot_leverage_resid2 - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_params - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_scale - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_ess - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_mse_total - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_bic - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_norm_resids - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_HC2_errors - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_missing - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/regression/tests/test_regression.py::TestOLS::test_norm_resid_zero_variance - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/tsa/tests/test_stattools.py::TestADFConstant::test_teststat - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/tsa/tests/test_stattools.py::TestADFConstant::test_pvalue - numpy.linalg.LinAlgError: SVD did not converge
ERROR statsmodels/tsa/tests/test_stattools.py::TestADFConstant::test_critvalues - numpy.linalg.LinAlgError: SVD did not converge

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

When I run using 1.18.5 + MKL I get no errors. It is hard to say whether this is likely an OpenBLAS bug or a Windows bug. Probably the latter, but it will be really hard to get to and diagnosing is beyond my capabilities for low-level debugging.

On the same physical machine, when I run in Ubuntu using pip packages I don't see any errors.

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

This is the strangest behavior. Fails on first call, works on second and subsequent. One other hard to understand behavior is that if I run the failing test in isolation than I don't see the error.

svd

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

One final update: if I test using a source build of NumPy without optimized BLAS I still see errors although they are not an identical set.

@mattip
Copy link
Member

mattip commented Jul 8, 2020

Maybe worth pinging the OpenBLAS devs. Does it happen with float32 as often as float64?

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

When I run a full test of NumPy 1.19.0, python -c "import numpy;numpy.test('full')" I see the same errors as above:

FAILED Python/Python38/Lib/site-packages/numpy/lib/tests/test_regression.py::TestRegression::test_polyfit_build - numpy.linalg.LinAlgError: SVD did not conv...
FAILED Python/Python38/Lib/site-packages/numpy/linalg/tests/test_regression.py::TestRegression::test_eig_build - numpy.linalg.LinAlgError: Eigenvalues did n...
FAILED Python/Python38/Lib/site-packages/numpy/ma/tests/test_extras.py::TestPolynomial::test_polyfit - numpy.linalg.LinAlgError: SVD did not converge in Lin...

@Caiptain1
Copy link
Author

I think if you only run the test exclusively it should pass if I remember correctly from pinging things around so that means even more strange behavior.

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

I have filed with Microsoft the only way I know how:

https://aka.ms/AA8xe7q

Posting in case others find this through search:

Windows users should use Conda/MKL if on 2004 until this is resolved

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

Here is a small reproducing example:

import numpy as np
a = np.arange(13 * 13, dtype=np.float64)
a.shape = (13, 13)
a = a % 17
va, ve = np.linalg.eig(a)

Produces

 ** On entry to DGEBAL parameter number  3 had an illegal value
 ** On entry to DGEHRD  parameter number  2 had an illegal value
 ** On entry to DORGHR DORGQR parameter number  2 had an illegal value
 ** On entry to DHSEQR parameter number  4 had an illegal value
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-11-bad305f0dfc7> in <module>
      3 a.shape = (13, 13)
      4 a = a % 17
----> 5 va, ve = np.linalg.eig(a)

<__array_function__ internals> in eig(*args, **kwargs)

c:\anaconda\envs\py-pip\lib\site-packages\numpy\linalg\linalg.py in eig(a)
   1322         _raise_linalgerror_eigenvalues_nonconvergence)
   1323     signature = 'D->DD' if isComplexType(t) else 'd->DD'
-> 1324     w, vt = _umath_linalg.eig(a, signature=signature, extobj=extobj)
   1325
   1326     if not isComplexType(t) and all(w.imag == 0.0):

c:\anaconda\envs\py-pip\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_eigenvalues_nonconvergence(err, flag)
     92
     93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):
---> 94     raise LinAlgError("Eigenvalues did not converge")
     95
     96 def _raise_linalgerror_svd_nonconvergence(err, flag):

LinAlgError: Eigenvalues did not converge

Does LAPACK count from 0 or 1? All of the illegal values appear to be integers:
DGEBAL
DGEHRD
DORGHR
DHSEQR

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

It is seeming more like an OpenBlas issue (or something between 2004 and OpenBLAS). Here is my summary:

NumPy only python -c "import numpy;numpy.test('full')"

  • No optimized BLAS: Pass full
  • OpenBLAS: Fail full
  • MKL: Pass full

statsmodels testing pytest statsmodels

  • Pip NumPy and SciPy: Fail related to SVD and QR related code
  • MKL NumPy and SciPy: Pass
  • No optimized BLAS: Fail, but fewer that all involve scipy.linalg routines, which use OpenBLAS.
  • No optimized BLAS, no SciPY linalg: Pass

@charris
Copy link
Member

charris commented Jul 8, 2020

It would be nice to learn what changed in 2004. Maybe we need a different flag when compiling/linking the library?

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

If it is an OpenBLAS bug, it is unlikely they will have seen it since none of the Windows-based CI are using build 19041 (aka Windows 10 2004) or later.

@charris
Copy link
Member

charris commented Jul 8, 2020

Just to be clear, it it true that none of these reports involve WSL?

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

No. All with either conda-provided python.exe or python.org provided python.exe

@carlkl
Copy link
Member

carlkl commented Jul 8, 2020

Does the test fail if the environment variable OPENBLAS_CORETYPE=Haswell or OPENBLAS_CORETYPE=NEHALEM is explicitely set ?

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

I tried Atom, SandyBridge, Haswell, Prescott and Nehalem, all with identical results.

@bashtage
Copy link
Contributor

bashtage commented Jul 8, 2020

The strangest thing is that if you run

import numpy as np
a = np.arange(13 * 13, dtype=np.float64)
a.shape = (13, 13)
a = a % 17
va, ve = np.linalg.eig(a)  # This will raise, so manually run the next line
va, ve = np.linalg.eig(a)

the second (and any further) calls to eig succeeds.

@carlkl
Copy link
Member

carlkl commented Dec 7, 2020

There is a new comment from Steve Wishnousky from 5 Dez at fmod(), after an update to windows 2004, is causing a strange interaction with other code:

An Insider Preview Windows build with the fix is now available. Build 20270 is now in the Dev Channel: announcing-windows-10-insider-preview-build-20270. If you are affected by this issue and want to test against the fix that will come via Windows Update, you can use this build. This is an insider build with other changes included, of course, so it is not entirely representative of the version that will be serviced, but it can be used to test for specific cases where the bug used to reproduce. We are still on schedule for servicing Windows build 19041 (the current release version) at the end of January 2021.

@ahtik
Copy link

ahtik commented Dec 10, 2020

Indeed, after updating to the latest Win 10 Dev Channel track (incl "Windows 10 Insider Preview 20270.1 (fe_release)"), numpy==1.19.4 no longer triggers the sanity check and works as expected.

Today this fix is available only in Dev Channel track, not yet in Beta nor Release Preview.

@bashtage
Copy link
Contributor

Hopefully it will be downstream next month and this can be closed.

@h-vetinari
Copy link
Contributor

I just updated from Windows 2004 to Windows 20H2 (OS Build: 19042.685), and unexpectedly, the import error under 1.19.4 (on the sanity check) that I had just encountered before the upgrade is gone now - I also double-checked the reproducing example manually, and it's not segfaulting.

>>> import numpy as np
>>> a = np.arange(13 * 13, dtype=np.float64)
>>> a.shape = (13, 13)
>>> a = a % 17
>>> va, ve = np.linalg.eig(a)
>>> va
array([ 1.03221168e+02 +0.j        , -1.91843603e+01 +0.j        ,
        1.82126812e+01 +0.j        , -6.04004526e-01+15.84422474j,
       -6.04004526e-01-15.84422474j, -1.13692929e+01 +0.j        ,
       -6.57612485e-01+10.41755503j, -6.57612485e-01-10.41755503j,
        1.06011014e+01 +0.j        ,  7.80732773e+00 +0.j        ,
       -7.65390898e-01 +0.j        ,  2.87796761e-15 +0.j        ,
       -1.26447204e-15 +0.j        ])

I then rechecked my C:\Windows\System32\ucrtbase.dll, and even more strangely, the explorer mouse-over tells me this is 10.0.19041.546 from 10/16/2020 (i.e. definitely older than some comments here, but then I have to admit I don't know what it was before the upgrade).

Not sure what's happening, but thought I'd let people know... :)

PS.

>>> import sys, numpy; print(numpy.__version__, sys.version)
1.19.4 3.8.6 | packaged by conda-forge | (default, Dec 22 2020, 09:52:49) [MSC v.1916 64 bit (AMD64)]

@zooba
Copy link
Contributor

zooba commented Jan 4, 2021

The date on the binary is probably the first time it was built after it was last modified (Windows uses a lot of caching between builds to try and keep the total build time under 24 hours). So it would seem like the fix has made its way out now.

@bashtage
Copy link
Contributor

bashtage commented Jan 4, 2021

It isn't out yet. Only in development builds. Using ucrtbase.dll 19041.546 from October 13, 2020 (fully patched 20H2)

(base) ➜  linearmodels git:(add-hdfe) ✗ conda activate empty
(empty) ➜  linearmodels git:(add-hdfe) ✗ ipython
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np
 ** On entry to DGEBAL parameter number  3 had an illegal value
 ** On entry to DGEHRD  parameter number  2 had an illegal value
 ** On entry to DORGHR DORGQR parameter number  2 had an illegal value
 ** On entry to DHSEQR parameter number  4 had an illegal value
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-0aa0b027fcb6> in <module>
----> 1 import numpy as np

c:\anaconda\envs\empty\lib\site-packages\numpy\__init__.py in <module>
    303
    304     if sys.platform == "win32" and sys.maxsize > 2**32:
--> 305         _win_os_check()
    306
    307     del _win_os_check

c:\anaconda\envs\empty\lib\site-packages\numpy\__init__.py in _win_os_check()
    300                    "See this issue for more information: "
    301                    "https://tinyurl.com/y3dm3h86")
--> 302             raise RuntimeError(msg.format(__file__)) from None
    303
    304     if sys.platform == "win32" and sys.maxsize > 2**32:

RuntimeError: The current Numpy installation ('c:\\anaconda\\envs\\empty\\lib\\site-packages\\numpy\\__init__.py') fails to pass a sanity check due to a bug in the windows runtime. See this issue for more information: https://tinyurl.com/y3dm3h86

@bashtage
Copy link
Contributor

bashtage commented Jan 4, 2021

conda forge uses a newer OpenBlas, which is why @h-vetinari 's passes. The standard pip install continues to fail because the fmod patch hasn't been distributed yet.

@NekoAlosama
Copy link

According to the above mention, the NumPy 1.19.5 release uses a workaround for the Windows 2004 bug, instead of waiting for Microsoft to fix it.

NumPy 1.19.5 is a short bugfix release. Apart from fixing several bugs, the main improvement is the update to OpenBLAS 0.3.13 that works around the windows 2004 bug while not breaking execution on other platforms. This release supports Python 3.6-3.9 and is planned to be the last release in the 1.19.x cycle.

@bashtage
Copy link
Contributor

The final fix is nearing release and is now in beta and release channels. Hopefully out next month.

Windows 10 build 19042.782

  • We fixed an issue that causes the 64-bit fmod() and remainder() functions to damage the Floating Point Unit (FPU) stack.

@cgohlke
Copy link
Contributor

cgohlke commented Feb 3, 2021

The fix is now available in KB4598291.

@mattip
Copy link
Member

mattip commented Feb 3, 2021

Thanks @bashtage for the first concise reproducer for this, that made it relatively simple to pinpoint the problem. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 32 - Installation Problems installing or compiling NumPy
Projects
None yet
Development

Successfully merging a pull request may close this issue.