Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mingw(clang32) github runner is broken #780

Open
DrTimothyAldenDavis opened this issue Mar 22, 2024 · 11 comments
Open

mingw(clang32) github runner is broken #780

DrTimothyAldenDavis opened this issue Mar 22, 2024 · 11 comments
Assignees
Labels
external bug porting issue, or problem with external library or system

Comments

@DrTimothyAldenDavis
Copy link
Owner

Github broke its mingw(clang32) runner.

The stable branch CI worked fine on March 2, 2024. The same CI fails on March 21, 2024. Nothing changed in the meantime, SuiteSparse and its .github/workflow files were unchanged. What did change was the github runner. Github switched the mingw(clang32) runner, and changed clang and openmp from 17.* to 18.*. Something broke, and it's not SuiteSparse.

See this update, which disables the mingw(clang32) tests:
b1bd9cc

The latest dev2 code breaks in the same way as the stable branch now breaks, with 4 test failures in LAGraph. One is an "OMP: out of heap memory" error, which is very strange since the problems being solved are very small.

Once the github runner is fixed, the above change to the SuiteSparse/.github/workflow/build.yaml file can be restored to its original state.

@DrTimothyAldenDavis DrTimothyAldenDavis added the external bug porting issue, or problem with external library or system label Mar 22, 2024
@DrTimothyAldenDavis DrTimothyAldenDavis self-assigned this Mar 22, 2024
@mmuetzel
Copy link
Contributor

Correct. LLVM was updated from version 17 to version 18 recently in MSYS2.

If I understand correctly some of the tests for LAGraph are failing since that update. Is that correct? Or is there another issue?
To be honest, I don't understand what is done in those tests. Is there a commonality to the failing tests?

Maybe, the LLVM update broke their compiler. Ideally, we could report that upstream with some context how to reproduce the error.

@mmuetzel
Copy link
Contributor

Some background: MSYS2 is in the process of dropping support for 32-bit platforms:
https://www.msys2.org/news/#2023-12-13-starting-to-drop-some-32-bit-packages

But iiuc, they didn't plan on dropping support for the compiler already.
Distributing a broken compiler is worse than distributing nothing though imho...

@DrTimothyAldenDavis
Copy link
Owner Author

Yes, there are 4 tests that fail in LAGraph. I thought at first it was because of some of my changes in GraphBLAS (9.0.3 to 9.1.0). But then I tried the stable branch and it failed in the identical manner.

The errors are strange but are repeatable. One method fails with "OMP: out of heap memory" which makes no sense. I'm guessing it's a bug in the update to OpenMP. Perhaps CLANG32 with no OpenMP would work.

The code in the stable branch passed the CI about a week ago. It also passed here:
https://github.com/DrTimothyAldenDavis/SuiteSparse/actions/runs/8124994169
which is the same code in the current stable branch ( d4dad6c ).

When the CLANG32 CI failed on dev2, I tried running it manually on the same d4dad6c version in the stable branch, but it failed:
https://github.com/DrTimothyAldenDavis/SuiteSparse/actions/runs/8379970870 .

Between these 2 CI runs of the stable branch, on d4dad6c, no code of my changed. The only thing changed was the github runner used. I diff'd the logs and saw that these 2 runs use different github runners. I had to process the logs to strip the leading text on each line first.

Here is the good output from 3 weeks ago, with the time stamp stripped from each line:
good.txt

Here is the bad output from just yesterday:
bad.txt

and the diff:
diff_bad_good.txt

Here is a trimmed diff with just the pertinent problems:
summary.txt

In the summary.txt file, the 4 failed tests are the same that fail when using the latest GraphBLAS 9.1.0 with LAgraph 1.1.3, in the dev2 and now dev branches.

So it's not my code that's broken. Something broke in github.

@mmuetzel
Copy link
Contributor

I agree that it is pretty unlikely that this is an error in the SuiteSparse sources that only shows up in that build configuration.
I don't think it is the GitHub runners that cause the issue here. The same runners still work correctly for the other build environments (e.g., MINGW32 which is GCC targeting Windows 32-bit).
It's more likely that it is the update to a newer LLVM in MSYS2 that caused the issue.
MSYS2 packages and distributes binaries for Windows (MinGW), similar to Homebrew for macOS. They also do rolling releases with all its advantages and disadvantages.

Do the four failing tests have anything in common? Like do they use the same omp pragma or something similar? Or do they test the same functions in GraphBLAS or LAGraph that might get miscompiled?

LLVM 18.1.2 has been released recently. MSYS2 will probably update to that version soon. Maybe, they've already fixed it?

@DrTimothyAldenDavis
Copy link
Owner Author

I haven't figured out why those 4 tests fail. They seem to have nothing in common. They likely do, I just don't know what it is. It's hard to track down since I'm not even sure which calls to GraphBLAS are failing, since each of these failed LAgraph methods makes lots of calls to GraphBLAS. It's probably a bug in OpenMP that is causing GraphBLAS to fail in some weird way.

Yes, when I say "the github runner is broken" I meant something in github or in the packages it uses is broken. I'm guessing it's either the 32-bit clang compiler or its openmp library that's broken.

The first few lines of the summary.txt shows the 2 github runner versions:

1c1
< Current runner version: '2.314.1'
---
> Current runner version: '2.313.0'

and later on, you see the llvm and openmp differences:

145,147c174,176
<  mingw-w64-clang-i686-llvm-18.1.1-3-any downloading...
<  mingw-w64-clang-i686-clang-18.1.1-3-any downloading...
<  mingw-w64-clang-i686-llvm-libs-18.1.1-3-any downloading...
---
>  mingw-w64-clang-i686-llvm-17.0.6-7-any downloading...
>  mingw-w64-clang-i686-clang-17.0.6-7-any downloading...
>  mingw-w64-clang-i686-llvm-libs-17.0.6-7-any downloading...

and this one:

183d211
<  mingw-w64-clang-i686-openmp-18.1.1-1-any downloading...
184a213
>  mingw-w64-clang-i686-openmp-17.0.6-1-any downloading...

Those packages are the only things that differ between the two runs. My code is the same. The one with clang-18.* fails while clang-17.* works.

I haven't tried CLANG32 with OpenMP disabled. If that works then the bug is in mingw-w64-clang-i686-openmp-18.1.1-1-any

The simplest thing for now is to just disable MINGW(CLANG32) entirely. I can renable it sometime in the future, once github switches to a fixed MSYS2 distribution for this case.

@DrTimothyAldenDavis
Copy link
Owner Author

To preserve this error, I will make a copy of the stable branch in SuiteSparse, and archive it:
https://github.com/DrTimothyAldenDavis/SuiteSparse/tree/github_CI_broke_this_branch

@mmuetzel
Copy link
Contributor

mmuetzel commented Mar 22, 2024

I'm able to reproduce the errors locally in a CLANG32 build environment.
When I build without OpenMP, the number of failing tests reduces to 2:

$ ctest . --rerun-failed --output-on-failure
Test project D:/repo/SuiteSparse/SuiteSparse/build-clang32
    Start 70: LAGraphX_BF
1/2 Test #70: LAGraphX_BF ......................***Failed    0.08 sec
Test test_BF...                                 transpose     time: 0.001

==========input graph: nodes: 34 edges: 156 source node: 0
BF_full1      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a     time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full       time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full):      -nan(ind)

Matrix: karate.mtx
GrB_BOOL matrix: 34-by-34 entries: 156
    (0, 1)   1
    (0, 2)   1
    (0, 3)   1
    (0, 4)   1
    (0, 5)   1
    (0, 6)   1
    (0, 7)   1
    (0, 8)   1
    (0, 10)   1
    (0, 11)   1
    (0, 12)   1
    (0, 13)   1
    (0, 17)   1
    (0, 19)   1
    (0, 21)   1
    (0, 31)   1
    (1, 0)   1
    (1, 2)   1
    (1, 3)   1
    (1, 7)   1
    (1, 13)   1
    (1, 17)   1
    (1, 19)   1
    (1, 21)   1
    (1, 30)   1
    (2, 0)   1
    (2, 1)   1
    (2, 3)   1
    (2, 7)   1
    (2, 8)   1
    (2, 9)   1
    (2, 13)   1
    ...
nthreads 1
result: 0
nthreads 1
nthreads 1
nthreads 1
result 0
BF_basic      time: 1.000000e-03 (sec), rate: 0.156 (1e6 edges/sec)
speedup of BF_basic:       0
BF_pure_c_double  : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c:      -nan(ind)
BF_full_mxv   time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv:    -nan(ind)
BF_basic_mxv  time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv:   -nan(ind)
transpose     time: 0

==========input graph: nodes: 67 edges: 294 source node: 0
BF_full1      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a     time: 1.000000e-03 (sec), rate: 0.294 (1e6 edges/sec)
BF_full2      time: 1.000000e-03 (sec), rate: 0.294 (1e6 edges/sec)
BF_full       time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full):      -nan(ind)
pure_c integer:
[ FAILED ]
  Case karate.mtx:
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed

Matrix: west0067.mtx
GrB_FP64 matrix: 67-by-67 entries: 294
    (0, 7)   -0.834182
    (0, 12)   1.26582
    (0, 17)   -0.336156
    (1, 8)   -0.834182
    (1, 13)   1.01266
    (1, 17)   -0.29392
    (2, 9)   -0.834182
    (2, 14)   0.759494
    (2, 17)   -0.221481
    (3, 10)   -0.834182
    (3, 15)   0.506329
    (3, 17)   -0.118986
    (4, 0)   -0.278842
    (4, 1)   -0.8
    (4, 6)   0.134462
    (4, 7)   0.4
    (4, 12)   0.4
    (5, 0)   -0.268019
    (5, 2)   -0.8
    (5, 6)   0.117568
    (5, 8)   0.4
    (5, 13)   0.4
    (6, 0)   -0.232372
    (6, 3)   -0.8
    (6, 6)   0.0885926
    (6, 9)   0.4
    (6, 14)   0.4
    (7, 0)   -0.157508
    (7, 4)   -0.8
    (7, 6)   0.0475944
    (7, 10)   0.4
    (7, 15)   0.4
    ...
nthreads 1
result: 0
  Case west0067.mtx:
    test_BF.c:187: Check result == valid... failed
nthreads 1
nthreads 1
nthreads 1
result 1
BF_basic      time: 1.000000e-03 (sec), rate: 0.294 (1e6 edges/sec)
speedup of BF_basic:       0
BF_pure_c_double  : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c:      -nan(ind)
BF_full_mxv   time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv:    -nan(ind)
BF_basic_mxv  time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv:   -nan(ind)
BF_full1      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a     time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full       time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full):      -nan(ind)

-------------------------- A = abs (A)
nthreads 1
result: 0
nthreads 1
nthreads 1
nthreads 1
result 0
BF_basic      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic:       -nan(ind)
BF_pure_c_double  : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c:      -nan(ind)
BF_full_mxv   time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv:    -nan(ind)
BF_basic_mxv  time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv:   -nan(ind)
transpose     time: 0

==========input graph: nodes: 7 edges: 12 source node: 0
BF_full1      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a     time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full       time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full):      -nan(ind)
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed
    test_BF.c:399: Check di == d[i]... failed

Matrix: matrix_int8.mtx
GrB_INT8 matrix: 7-by-7 entries: 12
    (0, 1)   127
    (0, 3)   7
    (1, 4)   5
    (1, 6)   8
    (2, 5)   1
    (3, 0)   -128
    (3, 2)   0
    (4, 5)   7
    (5, 2)   5
    (6, 2)   9
    (6, 3)   1
    (6, 4)   1
nthreads 1
result: 0
  Case matrix_int8.mtx:
    test_BF.c:187: Check result == valid... failed
nthreads 1
nthreads 1
nthreads 1
result 1
BF_basic      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic:       -nan(ind)
BF_pure_c_double  : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c:      -nan(ind)
BF_full_mxv   time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv:    -nan(ind)
BF_basic_mxv  time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv:   -nan(ind)
BF_full1      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full1a     time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full2      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
BF_full       time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
t(BF_full1) / t(BF_full):      -nan(ind)
pure_c integer:

-------------------------- A = abs (A)
nthreads 1
result: 0
nthreads 1
nthreads 1
nthreads 1
result 0
BF_basic      time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic:       -nan(ind)
BF_pure_c_double  : 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_pure_c:      -nan(ind)
BF_full_mxv   time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_full_mxv:    -nan(ind)
BF_basic_mxv  time: 0.000000e+00 (sec), rate: inf (1e6 edges/sec)
speedup of BF_basic_mxv:   -nan(ind)
pure_c integer:
    test_BF.c:399: Check di == d[i]... failed
FAILED: 1 of 1 unit tests has failed.

    Start 86: LAGraphX_msf
2/2 Test #86: LAGraphX_msf .....................***Failed    0.19 sec
Test msf...
================================== A.mtx:
result: 0

msf (known result):
GrB_UINT64 matrix: 7-by-7 entries: 6
    (1, 0)   1
    (2, 0)   1
    (3, 1)   1
    (4, 1)   1
    (5, 1)   1
    (6, 0)   1
[ FAILED ]
  Case A.mtx:
    test_msf.c:115: Check ok... failed

msf:
GrB_UINT64 matrix: 7-by-7 entries: 0

================================== jagmesh7.mtx:
result: 0

msf:
GrB_UINT64 matrix: 1138-by-1138 entries: 5
    (551, 552)   1
    (670, 671)   1
    (712, 722)   1
    (733, 743)   1
    (817, 816)   1

================================== west0067.mtx:
result: 0

msf:
GrB_UINT64 matrix: 67-by-67 entries: 0

================================== bcsstk13.mtx:
result: 0

msf:
GrB_UINT64 matrix: 2003-by-2003 entries: 6
    (1554, 1559)   0
    (1556, 1561)   0
    (1742, 1747)   0
    (1744, 1748)   0
    (1831, 1833)   0
    (1932, 1934)   0

================================== karate.mtx:
result: 0

msf:
GrB_UINT64 matrix: 34-by-34 entries: 1
    (23, 27)   1

================================== ldbc-cdlp-undirected-example.mtx:
result: 0

msf:
GrB_UINT64 matrix: 8-by-8 entries: 0

================================== ldbc-undirected-example-bool.mtx:
result: 0

msf:
GrB_UINT64 matrix: 9-by-9 entries: 0

================================== ldbc-undirected-example-unweighted.mtx:
result: 0

msf:
GrB_UINT64 matrix: 9-by-9 entries: 0

================================== ldbc-undirected-example.mtx:
result: 0

msf:
GrB_UINT64 matrix: 9-by-9 entries: 0

================================== ldbc-wcc-example.mtx:
result: 0

msf:
GrB_UINT64 matrix: 10-by-10 entries: 0
Test msf_errors...                              [ OK ]
FAILED: 1 of 2 unit tests has failed.


0% tests passed, 2 tests failed out of 2

Total Test time (real) =   0.29 sec

The following tests FAILED:
         70 - LAGraphX_BF (Failed)
         86 - LAGraphX_msf (Failed)
Errors while running CTest

@mmuetzel
Copy link
Contributor

I'm struggling to read the output of the failing tests.
Do they show what the expected result is and what the actual result is instead?

@DrTimothyAldenDavis
Copy link
Owner Author

No, they just show that the test failed. I would need to add more printf's to do that.

I did add some to the test_BF. It showed that the expected values for some d were finite, like 1 or 2, while the computed result was +infinity, which in this case means it was missing in the result (the result vector d was supposed to be full but it was returned sparse). That's very strange, and I didn't dig any deeper once I saw that the stable code also failed in the same way.

@mmuetzel
Copy link
Contributor

On the off-chance that this would make a difference, I tried again after MSYS2 updated to LLVM 18.1.2: Still the same failing tests in the CLANG32 environment with that version.

@DrTimothyAldenDavis
Copy link
Owner Author

Thanks for checking it.

It would be a difficult issue for me to track down to find the specific place in GraphBLAS where the compiler is failing, since I don't have a simple way to replicate this problem on my side. Even if I did, it would be very slow for me since I don't use Windows at all.

Hopefully a future version of LLVM will not have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external bug porting issue, or problem with external library or system
Projects
None yet
Development

No branches or pull requests

2 participants