Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on Azure CI (Windows instance) with numpy 1.19.0 #16913

Closed
mrava87 opened this issue Jul 20, 2020 · 54 comments · Fixed by MacPython/openblas-libs#35 or #16940
Closed

Error on Azure CI (Windows instance) with numpy 1.19.0 #16913

mrava87 opened this issue Jul 20, 2020 · 54 comments · Fixed by MacPython/openblas-libs#35 or #16940

Comments

@mrava87
Copy link

mrava87 commented Jul 20, 2020

Hello,
I have recently started experiencing problems when running tests for my project on Azure Pipelines with a Windows instance (vmImage: 'windows-2019'). Digging a little bit deeper (see this conversation https://developercommunity.visualstudio.com/content/problem/1102472/azure-pipeline-error-with-windows-vm.html?childToView=1119179#comment-1119179) we realised that the problem originated when we install numpy 1.19.0 instead of numpy 1.8.5 - I can see that numpy 1.19.0 was put on PyPI on June 20 and this is around the time when our tests started to fail. Forcing the environment to install numpy 1.8.5 as in previously successful builds seem to solve the problem.

I just wanted to report this as I assume this is something others may have started observing (but it is quite hard to pin-point that numpy is the issue... or at least looks like it is).

Looking forward to hearing from you,
and happy to do any change to my azure pipeline setup if that can help troubleshooting the problem.

Error message:

This build works fine with numpy 1.18.5: https://dev.azure.com/matteoravasi/PyLops/_build/results?buildId=46&view=logs&j=011e1ec8-6569-5e69-4f06-baf193d1351e
A build on the same commit with numpy 1.19.0 fails: https://dev.azure.com/matteoravasi/PyLops/_build/results?buildId=43&view=results

The error is very cryptic, what I explained above is more relevant I think. Here it is anyways:

2020-07-06T13:56:01.6879900Z Windows fatal exception: Current thread 0xaccess violation00001798
2020-07-06T13:56:01.6880280Z 
2020-07-06T13:56:01.6880589Z  (most recent call first):
2020-07-06T13:56:01.6880973Z   File "<__array_function__ internals>", line 6 in vdot
2020-07-06T13:56:05.3412520Z ##[debug]Exit code: -1073741819
@mattip
Copy link
Member

mattip commented Jul 20, 2020

Does it fail consistently or only once in a while? Do you have any windows developers who can try to build the project on a local machine?

@mrava87
Copy link
Author

mrava87 commented Jul 20, 2020

Hi,
thanks!

It failed consistently many times.. at that point I thought about asking Azure developers (my initial guess was that perhaps something had changed in their VMs setup).

This link has the discussion I had with a Microsoft developer who spotted the problem could have been numpy: https://developercommunity.visualstudio.com/content/problem/1102472/azure-pipeline-error-with-windows-vm.html?childToView=1119179#comment-1119179

Unfortunately I do not have anyone that can try building the project on a local windows machine :(

@mattip
Copy link
Member

mattip commented Jul 20, 2020

Then we will need a clear set of steps to reproduce

@mrava87
Copy link
Author

mrava87 commented Jul 20, 2020

Would the azure-pipelines.yml work?

Here is what we use (https://github.com/equinor/pylops/blob/master/azure-pipelines.yml) commented out at the moment... you can see that it is a pretty standard setup, using Python 3.7, installing dependencies in requirements-dev.txt file (https://github.com/equinor/pylops/blob/master/requirements-dev.txt) and then running the tests.

As I mentioned already, if I comment this out and force numpy 1.18.5 everything runs, seems like it is the new 1.19 to break

@bashtage
Copy link
Contributor

What is the windows version major and minor version of the image running on Azure? i.e., what does systeminfo print for OS Version?

@mrava87
Copy link
Author

mrava87 commented Jul 21, 2020

I could find the details of the Azure VMs used in Azure Pipelines here: https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/hosted?view=azure-devops&tabs=yaml and the link to installed software https://github.com/actions/virtual-environments/blob/master/images/win/Windows2019-Readme.md

I am not sure how to run systeminfo on a Azure pipeline, any suggestions?

@bashtage
Copy link
Contributor

It runs from the command line and dumps the output to terminal, so you can add it to your run as a command.

@bashtage
Copy link
Contributor

You could do this in a PR that runs on CI to see what it says. I am asking since there have been issues with the 19041 build of Windows and pip NumPy.

@bashtage
Copy link
Contributor

The answer was in the second link:

OS Version: 10.0.17763 Build 1282

@bashtage
Copy link
Contributor

So my idea bears no fruit.

@mrava87
Copy link
Author

mrava87 commented Jul 21, 2020

You say you know there are some issues with the latest pip wheels for Windows, is it probably connected to that?

@bashtage
Copy link
Contributor

bashtage commented Jul 21, 2020

It is actually (probably) a Windows bug introduced in 19041. But you are on a much older version so this is not the issue.

It doesn't affect Conda NumPy, only pip NumPy because it seems to be some issue with Windows and OpenBlas.

@mrava87
Copy link
Author

mrava87 commented Jul 22, 2020

I see :) I got an email that 1.9.1 has been released. I am going to try to retrigger the Azure pipeline which would now install the latest version and see if that works, will let you know

@mrava87
Copy link
Author

mrava87 commented Jul 22, 2020

@bashtage
Copy link
Contributor

bashtage commented Jul 22, 2020

A bug in OpenBlas.

Here is a reproducing example:

import numpy as np
nr = 12000
v = np.random.randn(nr) + 1j * np.random.randn(nr)
np.vdot(v, v)
# also access violations
v @ v
# also access violations

The no symbols debugging information is:

Exception thrown at 0x0000000068DBB8F0 (libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll)
in python.exe: 0xC0000005: Access violation reading location 0x0000000000000000.

Note that the array has to be pretty big (10k passes, 12k does not) to trigger the bug.

@bashtage
Copy link
Contributor

bashtage commented Jul 22, 2020

Quick check:

$env:OPENBLAS_VERBOSE=2
$env:OPENBLAS_CORETYPE=Prescott

passes but the default kernel (Zen), as well as Haswell and Sandybridge, both have access violations.

@mattip
Copy link
Member

mattip commented Jul 22, 2020

Maybe worth checking that numpy HEAD, which uses a newer OpenBLAS 0.3.10, also fails. Or maybe you already did?

@mrava87
Copy link
Author

mrava87 commented Jul 22, 2020

@mattip no I had not tried this yet. You mean installing bumpy directly from the master with pip install git+https://github.com/numpy/numpy? I can give it a try :)

@mrava87
Copy link
Author

mrava87 commented Jul 22, 2020

And to your question @bashtage (Do the failing tests use numba at all? numba 0.50 has a bug on some versions of windows where it incorrectly makes use of an unavailable intrinsic. This caused crashes for me in another project.) which I got via email but can't seem to see in this thread... the test that crashes uses both numpy and pyfftw operations. As it crashes with this sudden message it is hard to tell at which line it really crashes. But i don't think pyfftw uses numba at all, at least its not one of their dependencies

@mrava87
Copy link
Author

mrava87 commented Jul 22, 2020

I just tried with Installing the HEAD of NumPy directly from the GitHub repository and the windows build runs till completion - no sudden crash: https://dev.azure.com/matteoravasi/PyLops/_build/results?buildId=54&view=logs&j=011e1ec8-6569-5e69-4f06-baf193d1351e&t=bf6cf4cf-6432-59cf-d384-6b3bcf32ede2

Interestingly some libraries that have NumPy as dependency don’t seem to install properly (not sure why) and some tests fail for all OS, but at least it’s not a complete crash as before...

@bashtage
Copy link
Contributor

No error using nightly:

pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy

@bashtage
Copy link
Contributor

I just tried with Installing the HEAD of NumPy directly from the GitHub repository

This doesn't have OpenBLAS unless you explicitly build it in. By default you get a slow, generic BLAS with a pip install git+https://github.com/numpy/numpy.git.

@charris
Copy link
Member

charris commented Jul 22, 2020

Looks like we may want to upgrade OpenBLAS for 1.19.2, so marking this.

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Jul 22, 2020
@charris charris added this to the 1.19.2 release milestone Jul 22, 2020
@larsoner
Copy link
Contributor

I think I might be experiencing the same issue on latest --pre build (numpy-1.20.0.dev0+a0028bc) on Azure:

Current thread 0x000003d0 (most recent call first):
  File "<__array_function__ internals>", line 5 in dot
  File "D:\a\1\s\mne\minimum_norm\inverse.py", line 732 in _assemble_kernel

The line in question is just:

K = np.dot(eigen_leads, trans)

If it helps, I could try saving the arrays to disk and getting them out via an Azure artifact.

@bashtage
Copy link
Contributor

That looks like it. You are using the same pre that I had working correctly.

You might want to add

$env:OPENBLAS_VERBOSE=2

or

set OPENBLAS_VERBOSE=2

to your template to know which kernel is being used.

@bashtage
Copy link
Contributor

If it helps, I could try saving the arrays to disk and getting them out via an Azure artifact.

It would probably be enough to know the dtypes and dimensions.

@larsoner
Copy link
Contributor

larsoner commented Jul 23, 2020

Okay, reproduced on a single run of just the failing test with just numpy+scipy+matplotlib+pytest (and deps) that writes the matrices being multiplied and then uploads the artifacts, here is the artifacts tab:

https://dev.azure.com/mne-tools/mne-python/_build/results?buildId=8330&view=artifacts&type=publishedArtifacts

The last .npz should be the failing one (27 MB). Locally on Linux it dots just fine:

>>> import numpy as np
>>> data = np.load('1595525222.9485037.npz')
>>> np.dot(data['a'], data['b']).shape
(23784, 305)
>>> data['a'].shape, data['a'].dtype, data['b'].shape, data['b'].dtype
((23784, 305), dtype('>f4'), (305, 305), dtype('float64'))
>>> data['a'].flags, data['b'].flags
(  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
,   C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
)

Working on getting the OPENBLAS_VERBOSE working but it seems like every time I use pytest -s to not capture the output it actually passes. This might just be happenstance, though, we'll see...

@larsoner
Copy link
Contributor

I reported the error in OpenMathLib/OpenBLAS#2732 and they suggested it might be fixed in master, see OpenMathLib/OpenBLAS#2728 . No idea the best way to test this, though.

@bashtage
Copy link
Contributor

bashtage commented Jul 24, 2020

@mattip Do we know this is closed by MacPython/openblas-libs#35 ? Don't we need to wait until the next weekly is out?

@bashtage
Copy link
Contributor

@charris I think this issue is still open, and a backport will likely be needed.

@mattip
Copy link
Member

mattip commented Jul 24, 2020

Could someone with a reproducer try to build numpy with this commit to get the latest OpenBLAS binaries? So something like (mabe with typos)

git add remote mattip https://github.com/mattip/numpy.git
git fetch mattip  issue-16913
git checkout issue-16913
python tools/openblas_support.py
# copy the output openblas.a to a local directory and make sure numpy uses it
mkdir openblas
copy /path/to/openblas.a openblas
set OPENBLAS=openblas
python -c "from tools import openblas_support; openblas_support.make_init('numpy')"
pip install --no-build-isolation --no-use-pep517 .

You should have install gfortran with choco install -y mingw if you haven't already

@mattip
Copy link
Member

mattip commented Jul 24, 2020

... this is for windows

@larsoner
Copy link
Contributor

You should have install gfortran with choco install -y mingw if you haven't already

Is this only required for 32-bit?

https://github.com/numpy/numpy/blob/master/azure-steps-windows.yml#L29-L31

I'll try what you suggest above with a choco install -y mingw once I figure out what the /path/to/openblas.a is -- presumably from running tools/openblas_support.py (?).

@mattip
Copy link
Member

mattip commented Jul 24, 2020

Yes, python tools/openblas_support.py prints out where to find openblas.a

You need gfortran. The azure machines have mingw 64-bit installed. If you are 32-bits, the invocation is a bit different. You also need to set -m32 (but only for 32-bit).

@larsoner
Copy link
Contributor

I just verbatim copied most of https://github.com/numpy/numpy/blob/master/azure-steps-windows.yml using master branch of NumPy to first reproduce the error, and was successful in having it segfault.

I then switched to mattip/issue-16913 and it fails with a URL download error for:

https://anaconda.org/multibuild-wheels-staging/openblas-libs/v0.3.9-452-g349b722d/download/openblas-v0.3.9-452-g349b722d-win_amd64-gcc_7_1_0.zip

@larsoner
Copy link
Contributor

... looks like there is no 32-bit OpenBLAS for 64-bit Windows in:

https://anaconda.org/multibuild-wheels-staging/openblas-libs/files

I guess I could add the tag to get it to use 64-bit OpenBLAS?

@bashtage
Copy link
Contributor

2 are there and 1 is still being built. Should be up within the hour.

@larsoner
Copy link
Contributor

In the meantime I added:

        NPY_USE_BLAS_ILP64: '1'
        OPENBLAS_SUFFIX: '64_'

And it built just fine. No longer segfaults! I'll re-run it a few times just to be sure. Feel free to ping me when the 32-bit OpenBLAS Win64 libs are up and I can easily remove these lines and re-test.

@bashtage
Copy link
Contributor

Any change you run the full test suite :-)

python -c "import numpy; numpy.test('full')"

@larsoner
Copy link
Contributor

Looks like the 32 bit ones are up, and that also works.

I'll give the full test suite a run now

@larsoner
Copy link
Contributor

@bashtage
Copy link
Contributor

You shouldn't waste any more time on this other issue - I can wait until next week and test the weekly which will hopefully have the BLAS.

@charris
Copy link
Member

charris commented Jul 24, 2020

Note that we can run the nightly builds at anytime by pushing a commit to the master branch.

@bashtage
Copy link
Contributor

Ok, I'll wait until I see a new one to see if the issue with Windows 10 2004 is fixed.

@charris charris removed this from the 1.19.2 release milestone Sep 8, 2020
@charris
Copy link
Member

charris commented Sep 8, 2020

@bashtage Any update on this?

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Sep 8, 2020
@bashtage
Copy link
Contributor

bashtage commented Sep 8, 2020

OpenBLAS is still broken on the most recent release of Windows. It is very nonstandard to even get good debugging information because of the mixed to tool chain, at least for me.

@larsoner
Copy link
Contributor

FYI with OpenBLAS 0.3.16 it seems like the OPENBLAS_CORETYPE=prescott workaround I had in place on Azure ended up being probelmatic, so if anyone is using that workaround and sees problems with the latest NumPy pip-pre wheels you might need to remove the workaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants