Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures on linux aarch64 #4613

Open
the-horo opened this issue Apr 1, 2024 · 3 comments
Open

Test failures on linux aarch64 #4613

the-horo opened this issue Apr 1, 2024 · 3 comments

Comments

@the-horo
Copy link
Contributor

the-horo commented Apr 1, 2024

Some tests fail on my Raspberry Pi 4 running Gentoo.

The first issue is with std.internal.math.gammafunction:

ctest -R 'std.internal.math.gammafunction'
Test project /home/ldc/build
    Start  377: std.internal.math.gammafunction
1/4 Test  #377: std.internal.math.gammafunction ................***Failed    0.02 sec
    Start  812: std.internal.math.gammafunction-debug
2/4 Test  #812: std.internal.math.gammafunction-debug ..........***Failed    5.70 sec
    Start 1247: std.internal.math.gammafunction-shared
3/4 Test #1247: std.internal.math.gammafunction-shared .........***Failed    0.04 sec
    Start 1682: std.internal.math.gammafunction-debug-shared
4/4 Test #1682: std.internal.math.gammafunction-debug-shared ...***Failed    0.07 sec

This is because https://github.com/dlang/phobos/blob/14b23633b762cfd7b03614dca4c6b0cafa1016e5/std/internal/math/gammafunction.d#L396 contains real.mant_dig which on my PC is 64 but on the PI it is 113. I guess the solution is to change the value to a constant and not have it be dependent on real since this type varies by platform, but I'm not 100% sure what to do.

Secondly std.internal.exponential:

ctest -R 'std.math.exponential'
Test project /home/ldc/build
    Start  397: std.math.exponential
1/4 Test  #397: std.math.exponential ................***Failed    0.02 sec
    Start  832: std.math.exponential-debug
2/4 Test  #832: std.math.exponential-debug ..........   Passed    0.03 sec
    Start 1267: std.math.exponential-shared
3/4 Test #1267: std.math.exponential-shared .........***Failed    0.04 sec
    Start 1702: std.math.exponential-debug-shared
4/4 Test #1702: std.math.exponential-debug-shared ...   Passed    0.07 sec

Which is weird since only the release builds fail. I've tried to minimize the test case to:

real pow(real x, real y) @trusted @nogc pure nothrow
{
        long iy = cast(long) y;
        //assert(iy != y);
        if (iy == y) {
                assert(false);
        }
        assert(false);
}

void main () {
        import std.math.traits : isNaN;
        assert(isNaN(pow(-1.0L, 1/real.epsilon - 0.5L)));
}

The problem with this one is that the assert fails on different lines based on optimizations:

../bin/ldc2 -O -run repro.d
core.exception.AssertError@repro.d(6): Assertion failure
----------------
??:? [0x556070c0bf]
??:? [0x556070bd0b]
??:? [0x556072f28f]
??:? [0x5560711b6b]
??:? [0x556070a9d7]
??:? [0x5560709faf]
??:? [0x5560711837]
??:? [0x556071171b]
??:? [0x5560711587]
??:? [0x7fbd44738b]
??:? __libc_start_main [0x7fbd44745f]
??:? [0x5560709eaf]
Error: /tmp/repro-d8ac08 failed with status: 1

and

./bin/ldc2 -run repro.d
core.exception.AssertError@repro.d(8): Assertion failure
----------------
??:? [0x557b4cc1db]
??:? [0x557b4cbe27]
??:? [0x557b4ef3ab]
??:? [0x557b4d1c87]
??:? [0x557b4caaf3]
??:? [0x557b4ca027]
??:? [0x557b4ca043]
??:? [0x557b4d1953]
??:? [0x557b4d1837]
??:? [0x557b4d16a3]
??:? [0x557b4ca0cb]
??:? [0x7f9668738b]
??:? __libc_start_main [0x7f9668745f]
??:? [0x557b4c9eaf]
Error: /tmp/repro-1c99cf failed with status: 1

It's possible that bad code is generated. Another thing, I don't know how helpful, is that uncommenting the //assert line makes the optimized build not go into the if (and fail normally).

The last problem I had was with core.thread.fiber:

ctest --timeout 10 -R 'core.thread.fiber'
Test project /home/ldc/build
    Start  130: core.thread.fiber
1/4 Test  #130: core.thread.fiber ................***Timeout  10.03 sec
    Start  565: core.thread.fiber-debug
2/4 Test  #565: core.thread.fiber-debug ..........   Passed    0.26 sec
    Start 1000: core.thread.fiber-shared
3/4 Test #1000: core.thread.fiber-shared .........***Timeout  10.02 sec
    Start 1435: core.thread.fiber-debug-shared
4/4 Test #1435: core.thread.fiber-debug-shared ...   Passed    0.29 sec

The release builds hang. Running the tests directly I get:

./runtime/druntime-test-runner core.thread.fiber
Not safe to migrate Fibers between Threads on your system. Consider setting version CheckFiberMigration for this system in thread.d
****** FAIL release64 core.thread.fiber
core.exception.AssertError@core/thread/fiber.d(1078): HOLD != EXEC
----------------
??:? [0x559245bb43]
??:? [0x559245b527]
??:? [0x55924ca597]
??:? [0x55924cbc3b]
??:? [0x55923af9e3]
??:? [0x559247070b]
??:? [0x559246c77f]
??:? [0x55924ef213]
^C
./runtime/druntime-test-runner-shared core.thread.fiber
Not safe to migrate Fibers between Threads on your system. Consider setting version CheckFiberMigration for this system in thread.d
****** FAIL release64 core.thread.fiber
core.exception.AssertError@core/thread/fiber.d(1077): Fiber.yield() called with no active fiber
----------------
??:? _d_assert_msg [0x7fbf10bbc3]
??:? void core.thread.fiber.TestFiber.run() [0x7fbf1cc31f]
??:? fiber_entryPoint [0x7fbf1c83bb]
??:? [0x7fbf24ca23]
^C

Enabling ChekFiberMigration doesn't solve it:

./runtime/druntime-test-runner-shared core.thread.fiber
****** FAIL release64 core.thread.fiber
core.exception.AssertError@core/thread/fiber.d(1078): Fiber.yield() called with no active fiber
----------------
??:? _d_assert_msg [0x7f8555bbe3]
??:? void core.thread.fiber.TestFiber.run() [0x7f8561c3cb]
??:? fiber_entryPoint [0x7f856183db]
??:? [0x7f8569cb43]
^C

Better, I'm also getting segmentation faults sometime:

./runtime/druntime-test-runner-shared core.thread.fiber
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(_D4core7runtime18runModuleUnitTestsUZ19unittestSegvHandlerUNbNiiPSQCm3sys5posix6signal9siginfo_tPvZv+0x24)[0x7f81537500]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x7f8167d7a8]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(_D4core6thread5fiber5Fiber9switchOutMFNbNiZv+0x1c)[0x7f8154b458]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(_D4core6thread5fiber9TestFiber3runMFZv+0x5c)[0x7f8154c38c]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(fiber_entryPoint+0x68)[0x7f815483dc]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(+0x1dcb44)[0x7f815ccb44]
Segmentation fault (core dumped)

I have no idea how to approach fixing this one.

@kinke
Copy link
Member

kinke commented Apr 1, 2024

This is a subset of

ldc/.cirrus.yml

Lines 56 to 64 in e170ca5

elif [[ "$CI_OS-$CI_ARCH" == "linux-aarch64" ]]; then
# FIXME: segfaults with enabled optimizations
excludes+='|^core.thread.fiber(-shared)?$'
# FIXME: failing unittest(s)
excludes+='|^std.internal.math.gammafunction'
# FIXME: failing unittest(s) with enabled optimizations
excludes+='|^std.math.exponential(-shared)?$'
# FIXME: failure
excludes+='|^druntime-test-exceptions-debug$'
(been a while since checking if there's been any improvements), which is more up-to-date than #2153 (comment) from the AArch64 tracker issue.

IIRC, the math issues boil down to 2 problems - one being slightly incomplete 128-bit quadruple-precision real support in upstream Phobos, and something special wrt. optimized code and NaNs on AArch64 (not preserving the NaN payload or something like that).

The sporadic core.thread.fiber failures with enabled optimizations happen on macOS arm64 too, contrary to the math issues (Apple uses 64-bit real, on AArch64 too).

@the-horo
Copy link
Contributor Author

the-horo commented Apr 2, 2024

Thanks for the new links, I should have looked a bit harder before opening the issues.

IIRC, the math issues boil down to 2 problems - one being slightly incomplete 128-bit quadruple-precision real support in upstream Phobos, and something special wrt. optimized code and NaNs on AArch64 (not preserving the NaN payload or something like that).

The std.math.exponential bug doesn't look like it's using NaNs, the failure is caused by that if statement not being skipped, even though the value of y is 5.1923e+33 which shouldn't be representable as a ulong. I don't know how floating point numbers works in assembly though, much more on aarch64, so I won't try to dissect this issue further.

One last thing, should CheckFiberMigration be set for all aarch64 systems, not just Apple since I was getting that warning or is it safe to ignore?

@JohanEngelen
Copy link
Member

FYI it is this unittest in core.thread.fiber that is causing trouble on AArch64 (both macOS, and linux-musl):

// Multiple threads running shared fibers
unittest
{
shared bool[10] locks;
TestFiber[10] fibs;
void runShared()
{
bool cont;
do {
cont = false;
foreach (idx; 0 .. 10)
{
if (cas(&locks[idx], false, true))
{
if (fibs[idx].state == Fiber.State.HOLD)
{
fibs[idx].call();
cont |= fibs[idx].state != Fiber.State.TERM;
}
locks[idx] = false;
}
else
{
cont = true;
}
}
} while (cont);
}
foreach (ref fib; fibs)
{
fib = new TestFiber();
fib.allowMigration();
}
auto group = new ThreadGroup();
foreach (_; 0 .. 4)
{
group.create(&runShared);
}
group.joinAll();
foreach (fib; fibs)
{
assert(fib.sum == TestFiber.expSum);
}
}

    version(AArch64)
        return;

fixes things for me. (both on macOS and linux musl)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants