Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppc64el obj_basic_integration/TEST5 crashed on valgrind (debian + ubuntu; ppc64el) #6079

Open
bryceharrington opened this issue Apr 16, 2024 · 5 comments
Assignees
Labels
libpmemobj src/libpmemobj ppc64 (experimental) Type: Bug A previously unknown bug in PMDK

Comments

@bryceharrington
Copy link

bryceharrington commented Apr 16, 2024

ISSUE: ppc64el obj_basic_integration/TEST5 crashed on valgrind (debian + ubuntu; ppc64el)

Environment Information

  • PMDK package version(s): 1.13.1 (1.13.1-1.1)
  • OS(es) version(s): Debian, and Ubuntu
  • ndctl version(s): 77 (77-2+b1)
  • kernel version(s): 5.10.0 (5.10.0-28-powerpc64le)
  • binutils_2.42-2 dpkg-dev_1.22.4
  • g++-13_13.2.0-13
  • gcc-13_13.2.0-13
  • libc6-dev_2.37-15
  • libstdc++-13-dev_13.2.0-13
  • libstdc++6_14-20240201-3
  • linux-libc-dev_6.6.15-2

Please provide a reproduction of the bug:

Both Debian and Ubuntu are failing to build on the ppc64el architecture, where it used to build successfully at least a few months ago. I am guessing it started appearing after rebuilding against a newer linux-libc-dev?

How often bug is revealed: (always, often, rare): always

Actual Behavior

In Debian:
https://buildd.debian.org/status/fetch.php?pkg=pmdk&arch=ppc64el&ver=1.13.1-1.1%2Bb1&stamp=1708597682&raw=0
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1064559

In Ubuntu:
https://launchpadlibrarian.net/724116691/buildlog_ubuntu-noble-ppc64el.pmdk_1.13.1-1.1build1_BUILDING.txt.gz
https://launchpadlibrarian.net/724821331/buildlog_ubuntu-noble-ppc64el.pmdk_1.13.1-1.1build2_BUILDING.txt.gz
https://bugs.launchpad.net/ubuntu/+source/pmdk/+bug/2061913

Details

obj_basic_integration/TEST5 crashed (signal 4). err5.log below.
{ut_backtrace.c:175 ut_sighandler} obj_basic_integration/TEST5:

{ut_backtrace.c:176 ut_sighandler} obj_basic_integration/TEST5: Signal 4, backtrace:
{ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 0: ./obj_basic_integration(+0xc9f8) [0x18c9f8]
{ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 1: ./obj_basic_integration(+0xcb8c) [0x18cb8c]
{ut_backtrace.c:178 ut_sighandler} obj_basic_integration/TEST5:

err5.log below.
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:175 ut_sighandler} obj_basic_integration/TEST5:
obj_basic_integration/TEST5 err5.log
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:176 ut_sighandler} obj_basic_integration/TEST5: Signal 4, backtrace:
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 0: ./obj_basic_integration(+0xc9f8) [0x18c9f8]
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:120 ut_dump_backtrace} obj_basic_integration/TEST5: 1: ./obj_basic_integration(+0xcb8c) [0x18cb8c]
obj_basic_integration/TEST5 err5.log {ut_backtrace.c:178 ut_sighandler} obj_basic_integration/TEST5:
obj_basic_integration/TEST5 err5.log

Last 30 lines of memcheck5.log below (whole file has 48 lines).
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4915EB7: util_pool_create_uuids (set.c:2521)
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x49160FB: util_pool_create (set.c:2563)
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4941183: pmemobj_createU (obj.c:1164)
obj_basic_integration/TEST5 memcheck5.log ==89952== by 0x4941643: pmemobj_create (obj.c:1244)
obj_basic_integration/TEST5 memcheck5.log ==89952== Your program just tried to execute an instruction that Valgrind
obj_basic_integration/TEST5 memcheck5.log ==89952== did not recognise. There are two possible reasons for this.
obj_basic_integration/TEST5 memcheck5.log ==89952== 1. Your program has a bug and erroneously jumped to a non-code
obj_basic_integration/TEST5 memcheck5.log ==89952== location. If you are running Memcheck and you just saw a
obj_basic_integration/TEST5 memcheck5.log ==89952== warning about a bad jump, it's probably your program's fault.
obj_basic_integration/TEST5 memcheck5.log ==89952== 2. The instruction is legitimate but Valgrind doesn't handle it,
obj_basic_integration/TEST5 memcheck5.log ==89952== i.e. it's Valgrind's fault. If you think this is the case or
obj_basic_integration/TEST5 memcheck5.log ==89952== you are not sure, please let us know and we'll try to fix it.
obj_basic_integration/TEST5 memcheck5.log ==89952== Either way, Valgrind will now raise a SIGILL signal which will
obj_basic_integration/TEST5 memcheck5.log ==89952== probably kill your program.
obj_basic_integration/TEST5 memcheck5.log ==89952==
obj_basic_integration/TEST5 memcheck5.log ==89952== HEAP SUMMARY:
obj_basic_integration/TEST5 memcheck5.log ==89952== in use at exit: 3,172 bytes in 39 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== total heap usage: 193 allocs, 154 frees, 433,659 bytes allocated
obj_basic_integration/TEST5 memcheck5.log ==89952==
obj_basic_integration/TEST5 memcheck5.log ==89952== LEAK SUMMARY:
obj_basic_integration/TEST5 memcheck5.log ==89952== definitely lost: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== indirectly lost: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== possibly lost: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== still reachable: 3,172 bytes in 39 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== suppressed: 0 bytes in 0 blocks
obj_basic_integration/TEST5 memcheck5.log ==89952== Reachable blocks (those to which a pointer was found) are not shown.
obj_basic_integration/TEST5 memcheck5.log ==89952== To see them, rerun with: --leak-check=full --show-leak-kinds=all
obj_basic_integration/TEST5 memcheck5.log ==89952==
obj_basic_integration/TEST5 memcheck5.log ==89952== For lists of detected and suppressed errors, rerun with: -s
obj_basic_integration/TEST5 memcheck5.log ==89952== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

There are also some instances of valgrind crashes:

pmempool_feature/TEST4: SETUP (check/pmem/debug/memcheck)
../unittest/unittest.sh: line 747: 1396902 Illegal instruction /usr/bin/valgrind --tool=memcheck --log-file=memcheck4.log --suppressions=../memcheck-dlopen.supp --suppressions=../memcheck-dlopen.supp --leak-check=full --suppressions=../ld.supp --suppressions=../memcheck-libunwind.supp --suppressions=../memcheck-ndctl.supp ../../tools/pmempool/pmempool feature -d SHUTDOWN_STATE /tmp//test_pmempool_feature4😘⠏⠍⠙⠅ɗPMDKӜ⥺🙋/testset &>> grep4.log
pmempool_feature/TEST4 crashed (signal 4).
grep4.log below.

RUNTESTS: stopping: pmempool_feature/TEST4 failed, TEST=check FS=any BUILD=debug
pmempool_feature/TEST5: SETUP (check/pmem/debug/memcheck)
../unittest/unittest.sh: line 747: 1397154 Illegal instruction /usr/bin/valgrind --tool=memcheck --log-file=memcheck5.log --suppressions=../memcheck-dlopen.supp --suppressions=../memcheck-dlopen.supp --leak-check=full --suppressions=../ld.supp --suppressions=../memcheck-libunwind.supp --suppressions=../memcheck-ndctl.supp ../../tools/pmempool/pmempool feature -d SHUTDOWN_STATE /tmp//test_pmempool_feature5😘⠏⠍⠙⠅ɗPMDKӜ⥺🙋/testset &>> grep5.log
pmempool_feature/TEST5 crashed (signal 4).
grep5.log below.
pmempool_feature/TEST5 grep5.log query SHUTDOWN_STATE result is 1

1

Last 30 lines of memcheck5.log below (whole file has 65 lines).
pmempool_feature/TEST5 memcheck5.log ==1397154== Illegal opcode at address 0x4B59240
pmempool_feature/TEST5 memcheck5.log ==1397154== at 0x4B59240: ppc_flush (init.c:53)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x4B519C7: pmem_flush (pmem.c:229)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x4B51A6B: pmem_persist (pmem.c:240)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492CA93: util_persist (util_pmem.h:27)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492CBA7: util_persist_auto (util_pmem.h:40)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492DDC3: set_hdr (feature.c:256)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492E143: feature_set (feature.c:325)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492E967: disable_shutdown_state (feature.c:500)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492EF2F: pmempool_feature_disableU (feature.c:662)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x492F1AB: pmempool_feature_disable (feature.c:738)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x196897: feature_perform (feature.c:110)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x196897: pmempool_feature_func (feature.c:206)
pmempool_feature/TEST5 memcheck5.log ==1397154== by 0x18A45B: main (pmempool.c:271)
pmempool_feature/TEST5 memcheck5.log ==1397154==
pmempool_feature/TEST5 memcheck5.log ==1397154== HEAP SUMMARY:
pmempool_feature/TEST5 memcheck5.log ==1397154== in use at exit: 52,839 bytes in 21 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== total heap usage: 64 allocs, 43 frees, 108,953 bytes allocated
pmempool_feature/TEST5 memcheck5.log ==1397154==
pmempool_feature/TEST5 memcheck5.log ==1397154== LEAK SUMMARY:
pmempool_feature/TEST5 memcheck5.log ==1397154== definitely lost: 0 bytes in 0 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== indirectly lost: 0 bytes in 0 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== possibly lost: 0 bytes in 0 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== still reachable: 50,479 bytes in 16 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== suppressed: 2,360 bytes in 5 blocks
pmempool_feature/TEST5 memcheck5.log ==1397154== Reachable blocks (those to which a pointer was found) are not shown.
pmempool_feature/TEST5 memcheck5.log ==1397154== To see them, rerun with: --leak-check=full --show-leak-kinds=all
pmempool_feature/TEST5 memcheck5.log ==1397154==
pmempool_feature/TEST5 memcheck5.log ==1397154== For lists of detected and suppressed errors, rerun with: -s
pmempool_feature/TEST5 memcheck5.log ==1397154== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@bryceharrington bryceharrington added the Type: Bug A previously unknown bug in PMDK label Apr 16, 2024
@janekmi janekmi self-assigned this Apr 22, 2024
@janekmi janekmi added ppc64 (experimental) libpmemobj src/libpmemobj labels Apr 22, 2024
@janekmi
Copy link
Contributor

janekmi commented Apr 22, 2024

Hi. Thanks for the report. Sadly, we do not support ppc64. But since you are suggesting the issue might be related to the latest linux-libc-dev the question is: have you tried to build it on amd64 and using the same software components?

@pbalcer
Copy link
Member

pbalcer commented Apr 22, 2024

This is either a glibc or valgrind issue. See this message from the attached log:

Your program just tried to execute an instruction that Valgrind
did not recognise. There are two possible reasons for this.

  1. Your program has a bug and erroneously jumped to a non-code
    location. If you are running Memcheck and you just saw a
    warning about a bad jump, it's probably your program's fault.
  2. The instruction is legitimate but Valgrind doesn't handle it,
    i.e. it's Valgrind's fault. If you think this is the case or
    you are not sure, please let us know and we'll try to fix it.
    Either way, Valgrind will now raise a SIGILL signal which will
    probably kill your program.

Illegal opcode at address 0x4B59240

The instruction in question is this one: https://github.com/pmem/pmdk/blob/master/src/libpmem2/ppc64/init.c#L53
So it's most likely the latter. It's odd that it showed up as after updating libc though. Maybe there's now some other instruction there?

As @janekmi mentioned, Intel does not provide support for the PPC backend of PMDK. See this README section for details.

@bryceharrington
Copy link
Author

bryceharrington commented Apr 26, 2024

Thanks for pointing to __DCBF as the likely instruction causing the issue.

have you tried to build it on amd64 and using the same software components?

Indeed; we build for a number of architectures for Ubuntu, and the failure on ppc64el was holding back those updated builds of pmdk in the Ubuntu 24.04 LTS. Only ppc64el hit this particular issue. We were undertaking a mass re-build of the entire archive for some distro-wide security fixes and performance improvements, and I think might have been the first time pmdk got a rebuild since a new libc introduction, which is why I suspected that. Debian hit the issue earlier than us, but also updated libc before us.

And as I mentioned above, we could not ascertain if it is down to one cause or several. My gut says there may be additional missing instructions, but I did not acquire proof one way or the other. We also did not determine whether the tests were identifying "real" problems that users would run into, or were simply test suite strictness (your advice/opinion on this point would be valued).

We also noted your documented limitations on support for this platform in the README (thanks for having that officially in writing), and took that into account as well in determining what to do on our end. We also are constrained in hardware access for this architecture for debugging purposes, as well as time and know-how limitations. We considered dropping support for the architecture ourselves for pmdk, but worried that would simply move the problem to dependencies, and instead have disabled the testsuite in our CI for ppc64el and listed it as a Known Issue in the 24.04 release notes.

Ideally, we'd like to supply a stronger resolution to this going forward (especially if this will regress pmdk ppc64el users), even if it means dropping the architecture as supported in Ubuntu. If you don't have inclination to investigate, that is probably the right long term solution here. However if it is something you do want to investigate further, we would be happy to collaborate, just with the caveat that our ability to test/debug/develop on this arch is constrained.

@pbalcer
Copy link
Member

pbalcer commented Apr 29, 2024

Is your build system using PMDK's fork of valgrind (https://github.com/pmem/valgrind)? If it does, then it's possible that a new libc version is issuing instructions that the forked valgrind does not support. So that'd be an issue on PMDK's side. The fix is simply to rebase the valgrind fork to the latest upstream version. This is a problem we've encountered a few times in the past.

If you don't, then the bug is in valgrind and it should add support for the necessary instructions (so upstreaming parts of this patch). But if that's the case, how did this work before?

We also did not determine whether the tests were identifying "real" problems that users would run into, or were simply test suite strictness

The primary use of valgrind in PMDK's test suite is to verify the correctness of its algorithms. However, end users may still encounter issues such as the one you've reported if they themselves run applications linked with libpmem under valgrind, if the valgrind version they are using does not support all the necessary instructions.

We also are constrained in hardware access for this architecture for debugging purposes, as well as time and know-how limitations.

PMDK's CI environment does not include any PPC system at this point in time. Given the state of the project, we are unlikely to invest to acquire one.

even if it means dropping the architecture as supported in Ubuntu

That would be my recommendation. For all intents and purposes, upstream PMDK does not offer non-experimental support for platforms other than x86-64. However, simply disabling valgrind checks in the CI is also a reasonable option. 99.9% of the code is shared between all platforms (https://github.com/pmem/pmdk/tree/master/src/libpmem2/ppc64 this directory contains most of what differs, it's all fairly simple). So all the core algorithms are tested regardless on x86 builds.

if it is something you do want to investigate further, we would be happy to collaborate, just with the caveat that our ability to test/debug/develop on this arch is constrained

PMDK maintenance is currently done almost exclusively by Intel, and with very limited resources. We can help to some small extent, but ultimately you might want to reach out to IBM whether having official PMDK packages in ubuntu for their platforms is something they still care about.

@bryceharrington
Copy link
Author

Is your build system using PMDK's fork of valgrind (https://github.com/pmem/valgrind)? If it does, then it's possible that a new libc version is issuing instructions that the forked valgrind does not support. So that'd be an issue on PMDK's side. The fix is simply to rebase the valgrind fork to the latest upstream version. This is a problem we've encountered a few times in the past.

It doesn't look like it to me, although I do note the presence of the valgrind headers in src/core. But as a build-dependency, valgrind 3.15 or newer is being required. Both pmdk and valgrind got rebuilt in the archives within the last month, so I'm doubtful this is simply ABI compatibility, particularly given it occurring only on the one architecture.

We also did not determine whether the tests were identifying "real" problems that users would run into, or were simply test suite strictness

The primary use of valgrind in PMDK's test suite is to verify the correctness of its algorithms. However, end users may still encounter issues such as the one you've reported if they themselves run applications linked with libpmem under valgrind, if the valgrind version they are using does not support all the necessary instructions.

That's good to note, thanks. If potential issues would be limited to use of valgrind, then anyone developing on ppc64el in Ubuntu 24.04 would presumably have some options to work around the issue; hopefully for those cases the 24.04 release notes will be enough clue. If reports of tangible problems affecting users crop up we can re-evaluate but for now it sounds like we should continue to provide the package on this architecture in hopes that it helps more than harms.

even if it means dropping the architecture as supported in Ubuntu

That would be my recommendation. For all intents and purposes, upstream PMDK does not offer non-experimental support for platforms other than x86-64. However, simply disabling valgrind checks in the CI is also a reasonable option. 99.9% of the code is shared between all platforms (https://github.com/pmem/pmdk/tree/master/src/libpmem2/ppc64 this directory contains most of what differs, it's all fairly simple). So all the core algorithms are tested regardless on x86 builds.

if it is something you do want to investigate further, we would be happy to collaborate, just with the caveat that our ability to test/debug/develop on this arch is constrained

PMDK maintenance is currently done almost exclusively by Intel, and with very limited resources. We can help to some small extent, but ultimately you might want to reach out to IBM whether having official PMDK packages in ubuntu for their platforms is something they still care about.

That's a good suggestion, we'll reach out to our contacts before deciding what to do on 24.10 and going forward. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libpmemobj src/libpmemobj ppc64 (experimental) Type: Bug A previously unknown bug in PMDK
Projects
None yet
Development

No branches or pull requests

3 participants