Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lapack_sgeev routine not going to all routines shown in the call graph while debugging. #1013

Open
AjaySingh40 opened this issue Apr 25, 2024 · 17 comments

Comments

@AjaySingh40
Copy link

Hello all,
While debugging sgeev or ssyev functions of Lapack using gdb , it's not going to all the functions shown in the call graph.
It's going into only some funtions of lapack like lapack_dge_trans.c , lapacke_nancheck.c ,lapacke_sgeev.c, lapacke_sge_nancheck.c, lapacke_sgeev_work.c and ../kernel/arm64/../generic/lsame.c.
Not going into functions like sscal.c , sgemv.c , scopy , strmv etc.
I am not able to figure out why it's not going into these routines, It should go to some routines where the basic algebraic operations are performed.
How can I get those details of the routines. I am using gdb for debugging.
Thank You.

@martin-frbg
Copy link
Collaborator

The call graph contains everything that may be called depending on the properties of your input, not necessarily what will be called on every invocation. Also the call graphs you are using are for the reference implementations of both LAPACK and BLAS, while the "kernel/arm64" path that you mentioned strongly suggests that you are using OpenBLAS, which re-implements some functions in a different way.

@AjaySingh40
Copy link
Author

Yes I am using OpenBLAS. When debugging LAPACKE_sgesv routine it calls the functions under "kernel/arm64" like sgemv, sscal etc. where the basic operations are performed, but in case of other routines like ssyev or sgeev i don't see any functions under "kernel/arm64" being called. There must be routines that do the algebraic operations and it be called (correct me if i am wrong). I tried with different data sizes but those routines are never being called.
Thank You.

@langou
Copy link
Contributor

langou commented Apr 25, 2024

For a run of ssyev or sgeev, what routine do you see? For SYEV, I would expect to see SYMV, SYR2, SYR2K for the symmetric tridiagonal reduction. Then it gets complicated by yes, you should see some GEMM, TRMM, GEMV, TRMV.

You can see some of the call graph at:
https://netlib.org/lapack/explore-html-3.6.1/d2/d8a/group__double_s_yeigen_ga442c43fca5493590f8f26cf42fed4044.html

As Martin said, yes, (1) during a specific run, not all functions in the call tree are used. (2) we can only speak for what is done in reference LAPACK. Nothing wrong in using OpenBLAS. This is great actually, but then I am not sure if they changed the LAPACK algorithm or not.

All this being said, I agree that this is weird that you do not see more routines. We can explain it. But I find it surprising. Maybe some libraries are not compiled with a correct GDB flag so that GDB cannot "see" the routines in these libraries and it just gives the higher level drivers? (Not exactly sure what I am writing here!) Maybe you tell us all the routines that are being called in a given run, and we start from there.

@martin-frbg
Copy link
Collaborator

GESV in OpenBLAS is reimplemented (parallelized) as GETRF/GETRS but you should eventually find some GEMM/GEMV on your backtrace. No dirty tricks in GEEV so it should be following the call graph of the reference implementation. I trust you (re)built all of OpenBLAS with DEBUG=1 or the equivalent "-g" compiler flag ?

@AjaySingh40
Copy link
Author

Rebuilt the library and debugged ssyev . Following routines are seen while debugging

  1. ssyev.c (program name)
  2. lapacke_ssyev.c
  3. lapacke_nancheck.c
  4. lapacke_str_nancheck.c
  5. lapacke_lsame.c
  6. ../kernel/arm64/../generic/lsame.c
  7. lapacke_ssyev_work.c
  8. ssyev.f
  9. ilaenv.f
  10. ../INSTALL/sroundup_lwork.f
  11. lapacke_ssy_trans.c
  12. ../INSTALL/slamch.f
  13. slansy.f
  14. slaisnan.f
  15. sisnan.f
  16. ssytrd.f
  17. ssytd2.f
  18. slarfg.f
  19. sorgtr.f
  20. sorgql.f
  21. sorg2l.f
  22. slarf.f
  23. scal.c
  24. sorg2l.f
  25. ssteqr.f
  26. slanst.f
  27. slaev2.f
  28. slasr.f
  29. swap.c
  30. ../kernel/arm64/swap_thunderx2t99.S

In OpenBLAS-0.3.26 , I changed some of the routines under kernel/arm64 from .S to my sve .c routines like gemv_n.S to gemv_n.c , swap.c , scal.c , copy.c etc. When i run ssyev routine on this sve implemented blas it runs fine with matrix size 64 and gives error "failed to calculate the eigenvalues " and with increase in size further like 600 or 1000 it gives "segmentation fault" , I am not able to figure out why it's giving segmentation fault or failed to calculate eigen values. Any help is appreciated.
Thank You.
Thank You.

@martin-frbg
Copy link
Collaborator

If you are already debugging with gdb, it should be able to tell you where in the code the segmentation fault occurs (and if it is in any of the functions you wrote, or a pre-existing problem in OpenBLAS or LAPACK). At the gdb prompt, enter "handle 11 nopass" so that the program does not terminate on the segfault, and use "bt" to see the call stack when the segfault occurs.

@AjaySingh40
Copy link
Author

AjaySingh40 commented May 3, 2024

while debugging with size 200 it give the following error in OpenBLAS-0.3.26/lapack-netlib/SRC/ilaenv.f
ilaenv (ispec=1, _name=<error reading variable: Cannot access memory at address 0xffffbe790000_>,
opts=<error reading variable: Cannot access memory at address 0x1000000000000>, n1=200, n2=-1, n3=-1, n4=-1, _name=6,
_opts=1) at ilaenv.f:271

ssyev (jobz=..., uplo=..., n=200, a=..., lda=200, w=..., work=..., lwork=6800, info=0, _jobz=1, _uplo=1) at ssyev.f:184
ilaenv (ispec=1, name=<error reading variable: Cannot access memory at address 0xffffbe790000>,
opts=<error reading variable: Cannot access memory at address 0x1000000000000>, n1=200, n2=-1, n3=-1, n4=-1, _name=6,
_opts=1) at ilaenv.f:749

@martin-frbg
Copy link
Collaborator

looks like part of the stack got overwritten, try going "up" until you reach the last call that had meaningful arguments

@AjaySingh40
Copy link
Author

AjaySingh40 commented May 3, 2024

ilaenv (ispec=1, name=<error reading variable: Cannot access memory at address 0xffffbe790000>,
opts=<error reading variable: Cannot access memory at address 0x1000000000000>, n1=200, n2=-1, n3=-1, n4=-1, _name=6,
_opts=1) at ilaenv.f:749
749 END
(gdb)
189 $ 130, 140, 150, 160, 160, 160, 160, 160, 160)ISPEC
(gdb) up
#1 0x0000ffffbdd95c54 in ssyev (jobz=..., uplo=..., n=200, a=..., lda=200, w=..., work=..., lwork=6800, info=0, _jobz=1,
_uplo=1) at ssyev.f:191
191 NB = ILAENV( 1, 'SSYTRD', UPLO, N, -1, -1, -1 )
(gdb) up
#2 0x0000ffffbe56c028 in LAPACKE_ssyev_work (matrix_layout=101, jobz=86 'V', uplo=85 'U', n=200, a=0x4350b0, lda=200,
w=0xffffffffc060, work=0x45c5d0, lwork=6800) at lapacke_ssyev_work.c:69
69 LAPACK_ssyev( &jobz, &uplo, &n, a_t, &lda_t, w, work, &lwork, &info );
(gdb) up
#3 0x0000ffffbe56bdc4 in LAPACKE_ssyev (matrix_layout=101, jobz=86 'V', uplo=85 'U', n=200, a=0x4350b0, lda=200,
w=0xffffffffc060) at lapacke_ssyev.c:68
68 info = LAPACKE_ssyev_work( matrix_layout, jobz, uplo, n, a, lda, w, work,
(gdb) up
#4 0x0000000000400b90 in main () at ssyev.c:49
49 info = LAPACKE_ssyev( LAPACK_ROW_MAJOR, 'V', 'U', n, a, lda, w );

@martin-frbg
Copy link
Collaborator

strange, this does not even look as if one of your newly written BLAS kernels got called

@martin-frbg
Copy link
Collaborator

did you rebuild everything (make clean; make) after making your code changes ?

@AjaySingh40
Copy link
Author

yes.

@AjaySingh40
Copy link
Author

AjaySingh40 commented May 6, 2024

info = LAPACKE_ssyev( LAPACK_ROW_MAJOR, 'V', 'U', n, a, lda, w );
While calling this routine it gives info>0 (failed to calculate the eigen value) for all sizes of matrix greater than 32. Can this also be the error due to ilaenv.f file only? Or can there be any other problem also. Same problem occurs with LAPACK_sgeev and LAPACK_sgesvd.

@AjaySingh40
Copy link
Author

AjaySingh40 commented May 7, 2024

RELATIVE MACHINE PRECISION IS TAKEN TO BE 1.2E-07

cblas_sgemv PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - PARAMETER NUMBER 7 WAS CHANGED INCORRECTLY *******
******* cblas_sgemv FAILED ON CALL NUMBER:
10: cblas_sgemv ( CblasNoTrans, 2, 1, 0.7, A, 3, X, 1, 0.0, Y, 1) .
******* cblas_sgemv FAILED ON CALL NUMBER:
4: cblas_sgemv ( CblasNoTrans, 2, 1, 0.0, A, 3, X, 1, 0.0, Y, 1) .

******* FATAL ERROR - TESTS ABANDONED *******
OPENBLAS_NUM_THREADS=2 ./xdcblat2 < din2
TESTS OF THE DOUBLE PRECISION LEVEL 2 BLAS

THE FOLLOWING PARAMETER VALUES WILL BE USED:
FOR N 0 1 2 3 5 9 63
FOR K 0 1 2 4
FOR INCX AND INCY 1 2 -1 -2
FOR ALPHA 0.0 1.0 0.7
FOR BETA 0.0 1.0 0.9

ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN 16.00

COLUMN-MAJOR AND ROW-MAJOR DATA LAYOUTS ARE TESTED

RELATIVE MACHINE PRECISION IS TAKEN TO BE 2.2D-16

cblas_dgemv PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - PARAMETER NUMBER 7 WAS CHANGED INCORRECTLY *******
******* cblas_dgemv FAILED ON CALL NUMBER:
10: cblas_dgemv ( CblasNoTrans, 2, 1, 0.7, A, 3, X, 1, 0.0, Y, 1) .
******* cblas_dgemv FAILED ON CALL NUMBER:
4: cblas_dgemv ( CblasNoTrans, 2, 1, 0.0, A, 3, X, 1, 0.0, Y, 1) .

******* FATAL ERROR - TESTS ABANDONED *******
OPENBLAS_NUM_THREADS=2 ./xccblat2 < cin2
TESTS OF THE COMPLEX LEVEL 2 BLAS

THE FOLLOWING PARAMETER VALUES WILL BE USED:
FOR N 0 1 2 3 5 9 63
FOR K 0 1 2 4
FOR INCX AND INCY 1 2 -1 -2
FOR ALPHA ( 0.0, 0.0) ( 1.0, 0.0) ( 0.7,-0.9)
FOR BETA ( 0.0, 0.0) ( 1.0, 0.0) ( 1.3,-1.1)

ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN 16.00

COLUMN-MAJOR AND ROW-MAJOR DATA LAYOUTS ARE TESTED

RELATIVE MACHINE PRECISION IS TAKEN TO BE 1.2E-07

While building the library it fails this test case. Could you please explain meaning of incx and incy being -ve and is K used for number or rows?, Why does it shows "FATAL ERROR - PARAMETER NUMBER 7 WAS CHANGED INCORRECTLY" what is this parameter number 7. If see the cblas_sgemv( order, transa, m, n, alpha, a, lda, x, incx, beta,
y, incy ); routine in blas it shows the 1st parameter "order" here that is missing and here parameter 7 is lda.
Thank You.

@martin-frbg
Copy link
Collaborator

Negative increments means stepping over the array elements backwards. There is no K in GEMV, so this parameter is most likely unused in this particular test (cin2 contains inputs for a number of different level2 BLAS functions that xccblat2 checks one after the other). "order" is relevant for CBLAS only (BLAS assumes default Fortran matrix order, CBLAS offers you a choice of row-major and column-major and transforms the input accordingly for the actual BLAS call). Indeed the error message suggests that one of the input-only arguments was overwritten, which should not happen.

@AjaySingh40
Copy link
Author

THE FOLLOWING PARAMETER VALUES WILL BE USED:
FOR N 0 1 2 3 5 9 63
FOR K 0 1 2 4
FOR INCX AND INCY 1 2 -1 -2
FOR ALPHA 0.0 1.0 0.7
FOR BETA 0.0 1.0 0.9

ROUTINES PASS COMPUTATIONAL TESTS IF TEST RATIO IS LESS THAN 16.00

COLUMN-MAJOR AND ROW-MAJOR DATA LAYOUTS ARE TESTED

RELATIVE MACHINE PRECISION IS TAKEN TO BE 1.2E-07

cblas_sgemv PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
EXPECTED RESULT COMPUTED RESULT
1 -0.238480 0.119503
2 0.198470 0.219528
3 -0.489595E-01 -0.238480
4 0.352719E-01 -0.489595E-01
******* cblas_sgemv FAILED ON CALL NUMBER:
440: cblas_sgemv ( CblasNoTrans, 4, 2, 1.0, A, 5, X, 1, 0.0, Y, 1) .
******* cblas_sgemv FAILED ON CALL NUMBER:
4: cblas_sgemv ( CblasNoTrans, 2, 1, 0.0, A, 3, X, 1, 0.0, Y, 1).
Now it's not able to calculate the vlaues accurately. In these test cases the values of lda used are 5 and 3 right ?. Could you please explain about it. I tried using lad= max(1, row_size), but in this test case its taking value of lda greater than row_size and what is the signigicance of lda in gemv routine.

@AjaySingh40
Copy link
Author

How to handle the row major and col major matrices while doing an operation. If i write a program for row major will it work for both col major and row major. As you mentioned BLAS assumes default fortran matrix order that means it is col major . I tried with col major but still it fails the test cases while building the library. Then it should not be the problem with order ?
cblas_sgemv PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
EXPECTED RESULT COMPUTED RESULT
1 0.263061 0.263061
2 -0.330376 0.00000
******* cblas_sgemv FAILED ON CALL NUMBER:
7: cblas_sgemv ( CblasNoTrans, 2, 1, 1.0, A, 3, X, 1, 0.0, Y, 1) .
******* cblas_sgemv FAILED ON CALL NUMBER:
4: cblas_sgemv ( CblasNoTrans, 2, 1, 0.0, A, 3, X, 1, 0.0, Y, 1) .

******* FATAL ERROR - TESTS ABANDONED *******
OPENBLAS_NUM_THREADS=2 ./xdcblat2 < din2

It's able to calculate one value correctly but fails in the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants