Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: linalg.eigh segfault on Windows with OpenBLAS 0.3.16 #19469

Closed
larsoner opened this issue Jul 13, 2021 · 14 comments
Closed

BUG: linalg.eigh segfault on Windows with OpenBLAS 0.3.16 #19469

larsoner opened this issue Jul 13, 2021 · 14 comments

Comments

@larsoner
Copy link
Contributor

Over in MNE we test against the nightly NumPy builds, and as of the last few hours it looks like we're hitting an eigh segfault on Windows Azure on 1.22.0.dev0+442.g89c80ba60 on code that worked on other builds (e.g., 1.22.0.dev0+405.g8eaceff8a):

https://dev.azure.com/mne-tools/mne-python/_build/results?buildId=14294&view=logs&j=a017e066-62ca-5289-ad0b-8f57c84a089f&t=de70cabd-1dad-599c-0751-4f1f50c17e0f

Current thread 0x00000ed0 (most recent call first):
  File "c:\hostedtoolcache\windows\python\3.8.10\x64\lib\site-packages\numpy\linalg\linalg.py", line 1466 in eigh
  File "<__array_function__ internals>", line 5 in eigh
  File "D:\a\1\s\mne\beamformer\_compute_beamformer.py", line 194 in _compute_beamformer
  File "D:\a\1\s\mne\beamformer\_dics.py", line 226 in make_dics
...

I assume it's due to #19462 / OpenBLAS 0.3.16. Tomorrow I can try to reproduce locally on my Windows machine and dump the offending array to make a minimal example assuming it reproduces there. If I can't do that, I'll figure out a way to dump it on Azure and get it as a binary blob. But in the meantime I figured I'd open this in case others hit the same issue...

We also have the env var OPENBLAS_CORETYPE=Prescott in that build (from #16913), I'll first try removing that to see if it makes things work.

@larsoner
Copy link
Contributor Author

Okay looks like the first build with the OPENBLAS_CORETYPE at least did not die:

mne-tools/mne-python#9567

I'll go ahead and close and comment over in #16913 that the fix might be problematic now, but feel free to reopen if something actually seems useful to do at the NumPy end!

@charris
Copy link
Member

charris commented Jul 13, 2021

We are using ILP64 BLAS with the latest pre-wheels, so that might also lead to some issues.

@mattip
Copy link
Member

mattip commented Jul 14, 2021

Thanks for testing against the nightly builds. Using OPENBLAS_CORETYPE=Prescott should work since that is meant to use the "lowest common denominator" kernels. @martin-frbg thoughts?

@martin-frbg
Copy link

There have been reports of LAPACK testsuite segfaults on x86_64 with some operating systems (namely OSX), which may be linked to PR 3250 (adding a shortcut in SGEMV/DGEMV for small cases that "should not" need buffer allocation) and AVX512 targets (which is what Azure runs on AFAIK).

@charris
Copy link
Member

charris commented Jul 14, 2021

This is troubling for 1.21.2, I'd like to backport this for the arm64 fixes, but I'd also like it to work for Prescott.

EDIT: ARM64 wheels don't build without 0.3.16, so that forces my hand.

@martin-frbg
Copy link

0.3.17 released now with the fixes

@larsoner
Copy link
Contributor Author

Let me know if NumPy rolls out a wheel with 0.3.17 and I'm happy to put OPENBLAS_CORETYPE=prescott back in my Azure build and restart it a few times to make sure it works!

@charris
Copy link
Member

charris commented Jul 15, 2021

0.3.17 released now with the fixes

@martin-frbg Great. I note that 64 bit OpenBLAS on arm64 hangs when testing the dot product.

linalg/tests/test_linalg.py::test_blas64_dot config.sh: line 49:   381 Killed                  $PYTHON_EXE -c "$(get_test_cmd)"

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received

See https://travis-ci.com/github/MacPython/numpy-wheels/jobs/523962484

@martin-frbg
Copy link

Hm. I don't think anybody hurt dot in recent releases... anything else that test is exercising ?

@seberg
Copy link
Member

seberg commented Jul 15, 2021

Could it be that the test just bails out because it is extremely slow due to swapping? I am not sure how reliable our free_bytes=16e9 is.

@mattip
Copy link
Member

mattip commented Jul 15, 2021

It may be a segfault and not a hang

@martin-frbg
Copy link

martin-frbg commented Jul 15, 2021

Hm. Passes all the simple tests including xianyi's BLAS-Tester (ATLAS testsuite) on the MACmini in the gcc compile farm. I do not think I want to try building python there though - can you just restart the travis job to see if it could have been some unrelated fault ?
EDIT: passed the LAPACK testsuite as well

@charris
Copy link
Member

charris commented Jul 15, 2021

@martin It happened consistently: three tests/push, many pushes.

Could it be that the test just bails out because it is extremely slow due to swapping?

Hmm, could be, the travis machine may incorrectly report memory. I don't expect any of the test machines to actually run that test on account of too little memory. In fact @requires_memory(free_bytes=16e9) looks too small for that test, that is the size of one of the input vectors.

@charris
Copy link
Member

charris commented Jul 16, 2021

@martin Looks like a test problem, the testing process is oom killed. Default travis-ci memory ranges 2-4 GB, so the test should not normally run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants