Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Regression in DGESDD / DLASCL on macOS Vortex M1 #3654

Closed
larsoner opened this issue Jun 16, 2022 · 18 comments · Fixed by #3675
Closed

BUG: Regression in DGESDD / DLASCL on macOS Vortex M1 #3654

larsoner opened this issue Jun 16, 2022 · 18 comments · Fixed by #3675

Comments

@larsoner
Copy link
Contributor

On 8c20ca3 using OpenBLAS to build and compile SciPy I get problems when I try to take use DGESDD (or DGESVD) to take the SVD of a matrix of ones of shape (41, 50):

OpenBLAS larsoner$ make clean; git reset --hard; git clean -xdf; MACOSX_DEPLOYMENT_TARGET=11.0 make && make install
scipy larsoner$ git clean -xdf && ln -s ../site.cfg && python setup.py develop && python -c "import numpy as np, scipy.linalg; scipy.linalg.svd(np.ones((41, 50)))"
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/larsoner/python/scipy/scipy/linalg/_decomp_svd.py", line 131, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

On the previous commit 8e4c209, things seem to be okay (SVD decomp matches that on x86_64 Linux). Same thing with being on latest develop / 7060ca5 and doing git revert 8c20ca3 to revert just one of the two commits (8c20ca3) from #3399 things are okay.

@martin-frbg looks like #3399 this was your PR, can you replicate?

cross-ref numpy/numpy#21756

@martin-frbg
Copy link
Collaborator

Sadly I cannot test on macOS/M1 with the gccfarm machine unavailable. I see no obvious error in the parameters, and all usual BLAS and LAPACK tests pass on the Linux/M1 machine there. (Getting numpy installed there may be a challenge).
Right now my assumption is either a miscompilation by xcode, or some undetected fault in the ThunderX2 kernels.
The "invalid parameter 4" from DLASCL - i.e. something creating NaNs - reminds me of #2709 where the system fmod turned out to be the culprit (though on Windows)

@martin-frbg
Copy link
Collaborator

Adapted Fortran test code from #348 also runs fine on Linux/M1

@larsoner
Copy link
Contributor Author

@martin-frbg could you paste the file/commands here so a Fortran newbie (me) could compile + run a test locally?

@larsoner
Copy link
Contributor Author

(Also, this might only be a bug with the 32-bit interface, if that matters -- TBD)

@martin-frbg
Copy link
Collaborator

gesdd.f.txt
gfortran -ffree-form gesdd.f libopenblas.a -lpthread -o gesdd
./gesdd
(my tests were done with the 32bit interface, though if you happened to mix a 64bit-integer numpy with a 32bit-integer OpenBLAS or vice versa that would go a long way towards explaining illegal parameter value errors from LAPACK calls)

@larsoner
Copy link
Contributor Author

@martin-frbg indeed I have no problem running that code snippet on develop. I'm not sure why git revert 8c20ca345aad43c2f74a72b356afdbc1ec368e31 would fix things for NumPy. Feel free to close this if this Fortran code emitting no messages for you or me convinces you that the problem has to be at the NumPy end!

@martin-frbg
Copy link
Collaborator

Oh I'm easily upset, especially by someone with an actual math degree. Just in this case there is not much to go on and no way to reproduce this right now. (Guess I could try running the Fortran testcase on the neoverse in Travis CI though)

@larsoner
Copy link
Contributor Author

Oh I'm easily upset, especially by someone with an actual math degree.

That was a long time ago, I doubt it's helping here :)

One other interesting point is that there is no problem on M1 when using TARGET=ARMV8. Is it possible that this is a more appropriate target on Apple M1 silicon than VORTEX? At least according to this, VORTEX would not seem to be the correct target:

https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_processors

This idea was brought up by @hoechenberger in numpy/numpy#21756 (comment)

@martin-frbg
Copy link
Collaborator

VORTEX is the name we picked for the Apple-developed arm64 cpus, but there are no actual M1-specific BLAS kernels in OpenBLAS so far anyway, that TARGET primarily adds the compiler flags for using the 8.3 instruction set.

ARMV8 on the other hand is the most basic arm64 target,its assembly kernels do not require more than the 8.0 instruction set (or what e.g. a Cortex A57 from ten years ago has to offer). And before #3399, the kernel selection for VORTEX was an exact
copy of this. My PR basically replaced a few of its BLAS kernels with the slightly more advanced ones that were contributed for Cavium ThunderX2 (which is still only the 8.1 instruction set). And as it was only a first stab at speeding up M1, it did not even replace something heavyweight like GEMM, only operations like ASUM, DOT, NRM2 or SWAP. (Given the experience from #2709, it might be worthwile to roll back the NRM2KERNEL changes now though...)

@brada4
Copy link
Contributor

brada4 commented Jun 19, 2022

@larsoner it would help if you troubleshoot these ways:

  • change kernel/arm64/KERNEL.VORTEX to include KERNEL.ARMV8 instead, i.e what you said is working but compiled with more functional ISA-s enabled - this checks for miscompilation if any.

  • If you see picture here at netlib there are other call chains than direct to DLASCL , so it would be worth attaching debugger breakpoint to XERBLA function and seeing which call chain is anomalous generating NaN-s. That would reduce search space below.

  • copy KERNEL.NEOVERSEN1 to KERNEL.VORTEX and bisect , i.e. drill down commenting half of D????? function overrides until it starts working

Not having hardware or even QEMU equivalent here - all hope on you ;-)

@martin-frbg
Copy link
Collaborator

@brada4 doing git revert 8c20ca3 was exactly that. And no need to copy entire KERNEL fliles, just adding

SNRM2KERNEL = ../arm/nrm2.c
DNRM2KERNEL = ../arm/nrm2.c

after the include line should be sufficient for a test. (Unfortunately the gccfarm M1 is still unavailable, and even if it was up I am unsure if I could install numpy there as it had limited disk space)

@larsoner
Copy link
Contributor Author

there are other call chains than direct to DLASCL , so it would be worth attaching debugger breakpoint to XERBLA function and seeing which call chain is anomalous generating NaN-s. That would reduce search space below.

If you can point me to a tutorial or step-by-step for how to do this I'm happy to. But I'll start with trying to follow @martin-frbg 's suggestion since I think it covers what you suggest as well ...

change kernel/arm64/KERNEL.VORTEX to include KERNEL.ARMV8 instead...

No need to copy entire KERNEL fliles, just adding...

Okay so digging into kernels/arm64/VORTEX with what is in develop, and just to be complete -- using the following:

FAILS:

include $(KERNELDIR)/KERNEL.NEOVERSEN1

SUCCEEDS (n.b. the same as git revert 8c20ca3):

include $(KERNELDIR)/KERNEL.ARMV8

SUCCEEDS:

include $(KERNELDIR)/KERNEL.NEOVERSEN1
SNRM2KERNEL = ../arm/nrm2.c
DNRM2KERNEL = ../arm/nrm2.c

FAILS (makes sense because this is a double computation)

include $(KERNELDIR)/KERNEL.NEOVERSEN1
SNRM2KERNEL = ../arm/nrm2.c

SUCCEEDS:

include $(KERNELDIR)/KERNEL.NEOVERSEN1
DNRM2KERNEL = ../arm/nrm2.c

So it seems to be the dznrm2_thunderx2t99.c (included by KERNEL.NEOVERSEN1) that is problematic!

I'm happy to debug further if someone can give me some instructions -- I'm barely a C programmer nowadays so I'm hopeless in assembly :(

If you see picture here at netlib there are other call chains than direct to DLASCL

So another way to go here is to make it clear what code path NumPy is using that hits the bug. But before that, @martin-frbg I took another look at the pure-fortran example and noticed what I think is an error -- the second call was to SGESDD rather than DGESDD. When I change it to use DGESDD, I can replicate the error:

$ gfortran -ffree-form gesdd.f /opt/OpenBLAS/lib/libopenblas.a -lpthread -o gesdd
$ ./gesdd 
 ** On entry to DLASCL parameter number  4 had an illegal value
 ** On entry to DLASCL parameter number  4 had an illegal value
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_OVERFLOW_FLAG IEEE_UNDERFLOW_FLAG
STOP Error while computing the SVD!

@martin-frbg maybe you can replicate locally now, too? Here is the modified file (but the diff is just the one line changed):

gesdd.f.txt

@martin-frbg
Copy link
Collaborator

Oh sh.... sorry for that. Reproducible now on M1/Linux. Slightly improved solution is to set the DNRM2 (and SNRM2) kernel to "nrm2.S" (as used by most arm64 targets - and corresponding znrm2.S for CNRM2/ZNRM2 ). ThunderX2 kernel must be taking a shortcut or missing an initilalization somewhere that does not show up in the regular BLAS/LAPACK tests. Maybe even something introduced by its most recent fix to handle Inf in input.

@larsoner
Copy link
Contributor Author

So @martin-frbg for now do you suggest that we put in this little workaround? It looks like you're already working on something in #3657 so happy to let you move forward however you think is best...

@martin-frbg
Copy link
Collaborator

Right now #3657 is simply to test the various DNRM2 kernels on NeoverseN1 (I do not have access to any ThunderX2 server hardware but from old issue #1854 the exact cpu probably does not matter as long as it is V8.1 or higher).

@martin-frbg
Copy link
Collaborator

Using the nrm2.S kernels will definitely fix the bug and should be more performant than the plain-C "generic" kernels. (Though a fixed thunderx2t99-style kernel should give best performance - but I am still horribly out of my depth with ARM assembly)

@larsoner
Copy link
Contributor Author

but I am still horribly out of my depth with ARM assembly)

Okay so maybe until someone can fix the thunderx2t99 code to work on M1 / VORTEX we should just switch. +1 for putting this change into #3657, especially if your test case in that PR does turn things red on CIs so we can see it switch to green!

@larsoner
Copy link
Contributor Author

I commented this over in #3657 (comment) , but just to have all the testing in one place, this KERNEL.VORTEX also appears to be okay, as suggested by @martin-frbg :

include $(KERNELDIR)/KERNEL.NEOVERSEN1
DNRM2KERNEL = ../arm64/nrm2.S
SNRM2KERNEL = ../arm64/nrm2.S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants