Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when assigning Timer #4369

Open
NickJT opened this issue Aug 1, 2023 · 41 comments
Open

Segmentation fault when assigning Timer #4369

NickJT opened this issue Aug 1, 2023 · 41 comments
Labels
bug Something isn't working needs investigation This needs to be looked into before its "ready for work" triggers release Major issue that when fixed, results in an "emergency" release

Comments

@NickJT
Copy link

NickJT commented Aug 1, 2023

Running the following code after compiling in debug mode results in a segfault. Running the same code after compiling in release mode works as expected.

Environment:

  • Linux on Intel® Core™ i5-6500T
  • Fedora Linux 38 (Workstation Edition)

Ponyc version:

  • 0.54.1-c41393ce [release]
  • Compiled with: LLVM 15.0.7 -- Clang-16.0.3-x86_64

Zulip topic: https://ponylang.zulipchat.com/#narrow/stream/189985-beginner-help/topic/segfault.20with.20-d

Original code below copied from https://github.com/ponylang/ponyc/blob/main/examples/timers/timers.pony

use "time"

class TimerPrint is TimerNotify
  var _env: Env
  var _count: U64 = 0

  new iso create(env: Env) =>
    _env = env

  fun ref apply(timer: Timer, count: U64): Bool =>
    _count = _count + count
    _env.out.print("timer: " + _count.string())
    _count < 10

  fun ref cancel(timer: Timer) =>
    _env.out.print("timer cancelled")

actor Main
  new create(env: Env) =>
    let timers = Timers

    let t1 = Timer(TimerPrint(env), 500000000, 500000000) // 500 ms

    // moving the following line from its original position below to here causes
    // a segfault if compiled with -d
    let t2 = Timer(TimerPrint(env), 500000000, 500000000) // 500 ms

    let t1' = t1

    timers(consume t1)
    timers.cancel(t1')

    // with the following line here the example works as expected (with and without -d)
    //let t2 = Timer(TimerPrint(env), 500000000, 500000000) // 500 ms
    timers(consume t2)

@ponylang-main ponylang-main added the discuss during sync Should be discussed during an upcoming sync label Aug 1, 2023
@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 1, 2023

I am able to replicate. The commit ( 6806b6b ) where we switched to LLVM 15 introduces the bug. Not it only happens in debug mode, to release mode.

@SeanTAllen SeanTAllen added triggers release Major issue that when fixed, results in an "emergency" release help wanted Extra attention is needed bug Something isn't working needs investigation This needs to be looked into before its "ready for work" labels Aug 1, 2023
@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 1, 2023

Here's some wonderful LLDB output done with a debug version of the runtime:

Process 20953 stopped
* thread #3, name = 'nick-seg', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
    frame #0: 0x0000555555585b3c nick-seg`ponyint_heap_alloc_small(actor=0x0000555555584dc6, heap=0x0000555555584e06, sizeclass=0, track_finalisers_mask=0) at heap.c:469:29
   466    if(chunk != NULL)
   467    {
   468      // Clear and use the first available slot.
-> 469      uint32_t slots = chunk->slots;
   470      uint32_t bit = __pony_ctz(slots);
   471      slots &= ~((uint32_t)1 << bit);
   472
(lldb) bt
* thread #3, name = 'nick-seg', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: 0x0000555555585b3c nick-seg`ponyint_heap_alloc_small(actor=0x0000555555584dc6, heap=0x0000555555584e06, sizeclass=0, track_finalisers_mask=0) at heap.c:469:29
    frame #1: 0x000055555557c98a nick-seg`pony_alloc_small(ctx=0x00007fffe74789d0, sizeclass=0) at actor.c:1115:10
    frame #2: 0x000055555556dda3 nick-seg`Main_tag_create_oo(env=0x00007ffff7c77000) at list_node.pony:0:29
    frame #3: 0x000055555557bc2a nick-seg`handle_message(ctx=0x00007ffff7c78848, actor=0x00007ffff7c77800, msg=0x00007ffff7c4e4a0) at actor.c:507:7
    frame #4: 0x000055555557b2fa nick-seg`ponyint_actor_run(ctx=0x00007ffff7c78848, actor=0x00007ffff7c77800, polling=false) at actor.c:592:20
    frame #5: 0x0000555555589be3 nick-seg`run(sched=0x00007ffff7c78800) at scheduler.c:1075:23
    frame #6: 0x00005555555890d0 nick-seg`run_thread(arg=0x00007ffff7c78800) at scheduler.c:1127:3
    frame #7: 0x00007ffff7d11b43 libc.so.6`___lldb_unnamed_symbol3481 + 755
    frame #8: 0x00007ffff7da3a00 libc.so.6`___lldb_unnamed_symbol3865 + 11

It doesn't make much sense. I'm not sure what is is saying it is segfaulting on as chunk isn't null. and isn't address 0x0.

(lldb) p chunk
(chunk_t *) $0 = 0xfff0758b48f87d8b
(lldb) p chunk.slots
  Fix-it applied, fixed expression was:
    chunk->slots
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
(lldb) p chunk->slots
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory

@SeanTAllen
Copy link
Member

So my suspicion based on what I can see...

this works in release mode because "the bad code" is being optimized away in release mode. Specifically, I assume that the heap allocation is optimized away by the HeapToStack pass.

@SeanTAllen
Copy link
Member

That suspicion appears to be incorrect as I built a version of the compiler with HeapToStack completely disabled and a release build of the program still works.

@SeanTAllen
Copy link
Member

Ok, I found a change that fixes the issue, but it is a symptom not a cure.

having c->opt->strip_debug = true; "fixes" the problem.

that specifically activates this code:

  if(c->opt->strip_debug) {
    PB.registerOptimizerLastEPCallback(
      [&](ModulePassManager &mpm, OptimizationLevel level) {
        mpm.addPass(StripSymbolsPass());
      }
    );
  }

which appears to be "fixing" the crash.

I suspect that we probably are generating some bad LLVM that wasn't problematic prior to LLVM 15 or was introduced during our shift to LLVM 15.

@jemc
Copy link
Member

jemc commented Aug 1, 2023

The LLDB output posted in Sean's comment above is wacky in a few ways:

  1. It should not be possible for chunk to be null on that line, and thus should not be possible to segfault on that line.
  2. The nick-segMain_tag_create_oo(env=0x00007ffff7c77000) at list_node.pony:0:29line is strange becauselist_node.ponyis not where theMain.create` function is defined.

Maybe the debug table is being corrupted somehow?

@jemc
Copy link
Member

jemc commented Aug 1, 2023

Sean and I will discuss this more on Friday and maybe do some LLVM IR spelunking together.

@SeanTAllen
Copy link
Member

Minimal repro so far:

use "collections"

actor Main
  new create(env: Env) =>
    var i: USize = 0
    while i < 100000 do
      let t1 = Timer
      OurTimers(consume t1)

      let t2 = Timer
      OurTimers(consume t2)

      i = i + 1
    end

actor OurTimers
  new create() =>
    None

  be apply(timer: Timer iso) =>
    None

class Timer
  embed _node: ListNode[Timer]

  new iso create() =>
    _node = ListNode[Timer]
    try _node()? = this end

@SeanTAllen
Copy link
Member

Final minimal repro:

actor Main
  new create(env: Env) =>
    var i: USize = 0
    while i < 100000 do
      let t1 = Timer
      OurTimers(consume t1)

      let t2 = Timer
      OurTimers(consume t2)

      i = i + 1
    end

actor OurTimers
  new create() =>
    None

  be apply(timer: Timer iso) =>
    None

class Timer
  embed _node: TimerListNode

  new iso create() =>
    _node = TimerListNode
    try _node()? = this end

class TimerListNode
  var _item: (Timer| None)

  new create(item: (Timer | None) = None) =>
    _item = item

  fun ref update(value: Timer): Timer^ ? =>
    error

@SeanTAllen
Copy link
Member

This is the same issue as #3874 except that it is also appearing on X86 not just aarch64.

@SeanTAllen
Copy link
Member

Further refinement:

actor Main
  new create(env: Env) =>
    var i: USize = 0
    while i < 1_000_000 do
      let t1 = Timer
      let t2 = Timer
      OurTimers(consume t1)

      i = i + 1
    end

actor OurTimers
  new create() =>
    None

  be apply(timer: Timer iso) =>
    None

class Timer
  embed _node: TimerListNode

  new iso create() =>
    _node = TimerListNode
    try _node()? end

class TimerListNode
  fun apply() ? =>
    error

I'm no longer sure it is the same issue as #3874 but it has the same "slight change to address". With 3874 this was reproducible every time. This often works just fine on x86 and we need to do it often to get the issue and even then, its not guaranteed to happen.

@SeanTAllen
Copy link
Member

Here's a version that seems to be fully reproducible on my machine without looping which could make it #3874 again.

actor Main
  new create(env: Env) =>
    let t1 = Timer
    let t2 = Timer
    OurTimers(consume t1)

actor OurTimers
  new create() =>
    None

  be apply(timer: Timer iso) =>
    None

class Timer
  embed _node: TimerListNode

  new iso create() =>
    _node = TimerListNode
    try _node()? end

class TimerListNode
  fun apply() ? =>
    error

@SeanTAllen
Copy link
Member

@ponylang/committer can you test on your machines and see what happens with that program using ponyc 0.55.0 and report if it works or crashes and what your particular setup is in either case?

@SeanTAllen
Copy link
Member

I've discovered that our minimal repros do not always crash, in fact, I cant make most of my minimal reproductions crash if I build ponyc myself but I can with the original so let's go with the original.

@SeanTAllen
Copy link
Member

I cant reproduce this in CI except for the musl x86 builds. So very weird as I am reproducing locally on x86 glibc with ubuntu 22.04.

@NickJT
Copy link
Author

NickJT commented Aug 6, 2023

Results with minimal repo of Aug 5, 2023, 1:15 AM GMT+1

Windows 10 Pro Version 22H2
Intel(R) Core(TM) i5-6300U 8.00GB

ponyc 0.54.1 [release]
Compiled with: LLVM 15.0.7 -- MSVC-19.32.31332.0-x64 1932
release - runs to completion
debug - runs to completion

ponyc 0.55.0 [release]
Compiled with: LLVM 15.0.7 -- MSVC-19.32.31332.0-x64 1932
release - runs to completion
debug - runs to completion

Fedora Linux 38 (Workstation Edition)
Intel® Core™ i5-6500T 16.00GB

ponyc 0.54.1-c41393ce [release]
Compiled with: LLVM 15.0.7 -- Clang-16.0.3-x86_64
Defaults: pic=true
release - runs to completion
debug - 'Segmentation fault (core dumped)'

ponyc 0.55.0-2dc8d01c [release]
Compiled with: LLVM 15.0.7 -- Clang-16.0.6-x86_64
Defaults: pic=true
release - runs to completion
debug - 'Segmentation fault (core dumped)'

Ubuntu 23.04
Intel® Core™ i7-4770 16.00GB

ponyc 0.55.0-6c9b027a [release]
Compiled with: LLVM 15.0.7 -- Clang-15.0.7-x86_64
Defaults: pic=true
release - runs to completion
debug - 'Segmentation fault (core dumped)'

@SeanTAllen
Copy link
Member

@jemc what are the steps I should take to try using the "llvm tools" to link the program to see if it fixes the crash for me?

@SeanTAllen
Copy link
Member

I tried this with the x86 limux glibc environment being on GH now... still can't get a reproduction from

@SeanTAllen
Copy link
Member

@NickJT can you try retesting with ponyc --debug --strip?

@SeanTAllen
Copy link
Member

When setting up MacOS x86 CI, I found that the cycle detector runner test fails with bad access. That started with the LLVM 15 update as well. I can fix that and it appears this issue by doing --strip along with --debug (which was noted earlier). So, they appear to be related. I think running down what is going on with strip debug is the proper course of action once we get verification that it fixes the issue.

@SeanTAllen
Copy link
Member

The problem appears to be localized to using release versions of the runtime and then doing a debug compile of a pony program. Stripping debug symbols fixes the problem. Hypothesis: the debug info in the binary we generate with ponyc --debug is expecting information to be in the runtime that isn't there and "bad things happen".

@SeanTAllen
Copy link
Member

--strip does two things:

  StripDebugInfo(M);
  StripSymbolNames(M, false);

StripDebugInfo is what fixes our problem when we do --strip. Symbol names is unimportant to our bug.

@SeanTAllen
Copy link
Member

The interesting question here is why does changing the optimization level also fix the issue. Does it strip debug info?

@SeanTAllen
Copy link
Member

Not running MPM.run(*unwrap(c->module), MAM); in genopt.cc also fixes our issue.

@SeanTAllen
Copy link
Member

Turning off buildModuleSimplificationPipeline fixes our issue.

@SeanTAllen
Copy link
Member

and in there, removing

  if (EnableModuleInliner)
    MPM.addPass(buildModuleInlinerPipeline(Level, Phase));
  else
    MPM.addPass(buildInlinerPipeline(Level, Phase));

fixes our problem. note that both branches appear that they would cause us issues, but it appears only the else is currently active.

@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 15, 2023

Turning on LTO fixes the issue for me for one of the smaller reproductions but doesn't fix it for the original. --strip fixes for all.

@SeanTAllen
Copy link
Member

All debugging I did this evening was with:

use @pony_exitcode[None](code: I32)

actor Main
  new create(env: Env) =>
    let t1 = Timer
    let t2 = Timer
    OurTimers(consume t1)
    @pony_exitcode(0)

actor OurTimers
  new create() =>
    None

  be apply(timer: Timer iso) =>
    None

class Timer
  embed _node: TimerListNode

  new iso create() =>
    _node = TimerListNode
    try _node()? end

class TimerListNode
  fun apply() ? =>
    error

except for a last test for LTO where I used the original code when the issue was opened.

@SeanTAllen
Copy link
Member

Turning off

  if (EnableModuleInliner)
    MPM.addPass(buildModuleInlinerPipeline(Level, Phase));
  else
    MPM.addPass(buildInlinerPipeline(Level, Phase));

also fixes the original version.

@SeanTAllen
Copy link
Member

Using a debug runtime with --debug also crashes the original version (but not the reproduction I was using pretty much all testing this evening).

@NickJT
Copy link
Author

NickJT commented Aug 15, 2023

Results with minimal repo of Aug 5, 2023, 1:15 AM GMT+1

Fedora Linux 38 (Workstation Edition)
Intel® Core™ i5-6500T 16.00GB

ponyc 0.55.0-2dc8d01c [release]
Compiled with: LLVM 15.0.7 -- Clang-16.0.6-x86_64
Defaults: pic=true

ponyc --debug : segfault
ponyc --debug --strip : runs to completion

Ubuntu 23.04
Intel® Core™ i7-4770 16.00GB

ponyc 0.55.0-6c9b027a [release]
Compiled with: LLVM 15.0.7 -- Clang-15.0.7-x86_64
Defaults: pic=true

ponyc --debug : segfault
ponyc --debug --strip : runs to completion

@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 15, 2023

I found what in the switch to LLVM 15 caused this issue to crop up. I still have a few things to discuss with Joe, but, we changed the optimizations run for --debug when we switched to LLVM 15 (I'm not sure this was intentional) which then caused a pre-existing bug with "release mode" and "dwarf debug info" to start appearing with --debug. Interestingly, I have learned a bit to discuss about the underlying problem with Joe, however, I have a fix that would "take us back to how optimizations were for LLVM 14" to go over at sync with @jemc. We can "do the fix back to LLVM 14" and I can write up more about the debug issue.

I can create this issue with LLVM 14 by having inliner passes for optimization levels above -O0 run.

SeanTAllen added a commit that referenced this issue Aug 15, 2023
When we upgraded to LLVM 15, we accidentally changed our optimization
level for programs compiled with --debug. This uncovered a pre-existing
condition and caused issues like #4369.

This takes us back to LLVM 14 level but doesn't "fix" the underlying
issue which we are continuing to work on.
SeanTAllen added a commit that referenced this issue Aug 15, 2023
When we upgraded to LLVM 15, we accidentally changed our optimization
level for programs compiled with --debug. This uncovered a pre-existing
condition and caused issues like #4369.

This takes us back to LLVM 14 level but doesn't "fix" the underlying
issue which we are continuing to work on.
SeanTAllen added a commit that referenced this issue Aug 15, 2023
When we upgraded to LLVM 15, we accidentally changed our optimization
level for programs compiled with --debug. This uncovered a pre-existing
condition and caused issues like #4369.

This takes us back to LLVM 14 level but doesn't "fix" the underlying
issue which we are continuing to work on.
SeanTAllen added a commit that referenced this issue Aug 15, 2023
When we upgraded to LLVM 15, we accidentally changed our optimization
level for programs compiled with --debug. This uncovered a pre-existing
condition and caused issues like #4369.

This takes us back to LLVM 14 level but doesn't "fix" the underlying
issue which we are continuing to work on.
SeanTAllen added a commit that referenced this issue Aug 16, 2023
@SeanTAllen
Copy link
Member

This is "fixed" in 0.55.1 but the underlying issue remains. #4401 has a "fix" that comes closer to addressing the underlying issue (but at the moment is just an investigation).

@SeanTAllen SeanTAllen removed the help wanted Extra attention is needed label Aug 16, 2023
@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 25, 2023

Results with minimal repo of Aug 5, 2023, 1:15 AM GMT+1

Fedora Linux 38 (Workstation Edition) Intel® Core™ i5-6500T 16.00GB

ponyc 0.55.0-2dc8d01c [release] Compiled with: LLVM 15.0.7 -- Clang-16.0.6-x86_64 Defaults: pic=true

ponyc --debug : segfault ponyc --debug --strip : runs to completion

Ubuntu 23.04 Intel® Core™ i7-4770 16.00GB

ponyc 0.55.0-6c9b027a [release] Compiled with: LLVM 15.0.7 -- Clang-15.0.7-x86_64 Defaults: pic=true

ponyc --debug : segfault ponyc --debug --strip : runs to completion

@NickJT did you build those ponyc's that you used to test yourself?

@NickJT
Copy link
Author

NickJT commented Aug 26, 2023 via email

@SeanTAllen
Copy link
Member

@NickJT did you do a release or debug configuration when building the compiler?

@NickJT
Copy link
Author

NickJT commented Aug 26, 2023 via email

SeanTAllen added a commit that referenced this issue Aug 27, 2023
@SeanTAllen
Copy link
Member

@NickJT can you update to commit 3bafb91 and rebuild and verify --debug with a release runtime no longer segfaults for you? both of the commits you were testing with are before that and I believe there is a change from August 8th (a couple commits before the one i asked you to update to) that changes the nature of the bug and will make it so that it only happens on x86 glibc linux with debug rather than release runtimes.

@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 27, 2023

I've learned a lot about weirdness we are seeing around this.

For example, builds starting with the August 9 nightly do not segfault. This corresponds with switching from Cirrus to GH and a new builder image for releases. However that isn't what appears to "fix" the issue when using a release runtime... 277329e changes the nature of the boom so on glibc Linux it appears to only happen with debug runtime versions.

Another weirdness was "why is this passing for CI with the regression". That turns out to be a bug that needs to be investigated in our running of the regression-4369 test on glibc linux. It passes CI when using a debugger, but fails otherwise (assuming you are on a branch that 'unfixes' the fix we slapped onto main while investigating further. See https://ponylang.zulipchat.com/#narrow/stream/190359-ci/topic/runner.20on.20x86.20glibc.20CI

So in a bit of "sanity", this appears to now be back to run of the mill, "wtf" optimization insanity related in some fashion to some debug info. You know, really hard, but not completely off the wall.

Investigation continues.

@NickJT
Copy link
Author

NickJT commented Aug 28, 2023

I hope this is what you asked me to test, but commit 3bafb91 gives a segfault with -d (though not with -d --strip).

I've included the relevant sections of the terminal log below (just in case I missed something).

With the latest commit (66c74e6) the minimal example runs to completion just with -d.

Fedora Linux 38 (Workstation Edition)
Kernel version: Linux 6.4.12-200.fc38.x86_64
Intel® Core™ i5-6500T 16.00GB

update to commit 3bafb91

[nick@pete ponyc]$ git checkout 3bafb91
Previous HEAD position was 713ec1b Add unreleased section to CHANGELOG post 0.55.1 release
HEAD is now at 3bafb91 Fix stress test misconfiguration after move from CirrusCI

configure/build/install...

[nick@pete ponyc]$ ponyc -v
0.55.0-3bafb917 [release]
Compiled with: LLVM 15.0.7 -- Clang-16.0.6-x86_64
Defaults: pic=true
[nick@pete ponyc]$ cd ~/pony/minimal
[nick@pete minimal]$ ponyc -d -o bin
Building builtin -> /usr/local/lib/pony/0.55.0-3bafb917/packages/builtin
Building . -> /home/nick/pony/minimal
Generating
Reachability
Selector painting
Data prototypes
Data types
Function prototypes
Functions
Descriptors
Writing bin/minimal.o
Linking bin/minimal
[nick@pete minimal]$ ./bin/minimal
Segmentation fault (core dumped)
[nick@pete minimal]$ ponyc -d --strip -o bin
Building builtin -> /usr/local/lib/pony/0.55.0-3bafb917/packages/builtin
Building . -> /home/nick/pony/minimal
Generating
Reachability
Selector painting
Data prototypes
Data types
Function prototypes
Functions
Descriptors
Writing bin/minimal.o
Linking bin/minimal
[nick@pete minimal]$ ./bin/minimal
runs to completion
[nick@pete minimal]$

with latest commit

[nick@pete ponyc]$ git checkout 66c74e6
Previous HEAD position was 3bafb91 Fix stress test misconfiguration after move from CirrusCI
HEAD is now at 66c74e6 Fix small output error in make.ps1 (#4411)

[configure/build/install...]

[nick@pete ponyc]$ ponyc -v
0.55.1-66c74e6c [release]
Compiled with: LLVM 15.0.7 -- Clang-16.0.6-x86_64
Defaults: pic=true
[nick@pete ponyc]$ cd ~/pony/minimal
[nick@pete minimal]$ ponyc -d -o bin
Building builtin -> /usr/local/lib/pony/0.55.1-66c74e6c/packages/builtin
Building . -> /home/nick/pony/minimal
Generating
Reachability
Selector painting
Data prototypes
Data types
Function prototypes
Functions
Descriptors
Writing bin/minimal.o
Linking bin/minimal
[nick@pete minimal]$ ./bin/minimal
runs to completion
[nick@pete minimal]$

@SeanTAllen
Copy link
Member

SeanTAllen commented Aug 29, 2023

While looking into at a related issue, I ran the regression test for this under gdb rather than lldb and got the following:

[  FAILED  ] regression-4369 (7750 ms): expected exit code 0; actual was 255
regression-4369: STDOUT:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffefc78640 (LWP 16647)]
[New Thread 0x7fffef477640 (LWP 16648)]
[New Thread 0x7fffe6c76640 (LWP 16649)]
[New Thread 0x7fffde475640 (LWP 16650)]

Thread 3 "regression-4369" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffef477640 (LWP 16648)]
0x000055555558361c in ponyint_heap_alloc_small ()
A debugging session is active.

        Inferior 1 [process 16598] will be killed.

Quit anyway? (y or n) [answered Y; input not from terminal]

regression-4369: STDERR
Dwarf Error: Cannot find DIE at 0x1f69 [from module /home/sean/code/ponylang/ponyc-4369/build/debug/runner-tests/debug/regression-4369]

Undefined command: "".  Try "help".
Undefined command: "".  Try "help".
Undefined command: "".  Try "help".
Undefined command: "".  Try "help".
Undefined command: "".  Try "help".
Undefined command: "".  Try "help".
Undefined command: "".  Try "help".
Undefined command: "".  Try "help".

Given the noise, I'm not sure how much stock to put into this, however, the Dwarf Error: Cannot find DIE at 0x1f69 [from module /home/sean/code/ponylang/ponyc-4369/build/debug/runner-tests/debug/regression-4369] did catch my attention.

It is possible something we are doing makes backtracing in gdb not work, but, it is worth someone looking into.

Note that dwarfdump doesnt report any issues:

test/libponyc-run/regression-4369 on  force-4369 [?] ➜ ~/code/ponylang/ponyc-4369/build/libs/bin/llvm-dwarfdump ./regression-436
9 --verify
Verifying ./regression-4369:    file format elf64-x86-64
Verifying .debug_abbrev...
Verifying .debug_info Unit Header Chain...
Verifying .debug_types Unit Header Chain...
Verifying non-dwo Units...
Verifying unit: 1 / 32, "regression-4369"
Verifying unit: 2 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/actor/actor.c"
Verifying unit: 3 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/actor/messageq.c"
Verifying unit: 4 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/asio/epoll.c"
Verifying unit: 5 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/asio/event.c"
Verifying unit: 6 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/gc/cycle.c"
Verifying unit: 7 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/gc/delta.c"
Verifying unit: 8 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/gc/gc.c"
Verifying unit: 9 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/gc/objectmap.c"
Verifying unit: 10 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/gc/serialise.c"
Verifying unit: 11 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/gc/trace.c"
Verifying unit: 12 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/lang/posix_except.c"
Verifying unit: 13 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/lang/stdfd.c"
Verifying unit: 14 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/mem/heap.c"
Verifying unit: 15 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/mem/pagemap.c"
Verifying unit: 16 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/mem/pool.c"
Verifying unit: 17 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/platform/ponyassert.c"
Verifying unit: 18 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/sched/cpu.c"
Verifying unit: 19 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/sched/scheduler.c"
Verifying unit: 20 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/sched/start.c"
Verifying unit: 21 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/asio/asio.c"
Verifying unit: 22 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/ds/fun.c"
Verifying unit: 23 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/ds/hash.c"
Verifying unit: 24 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/ds/stack.c"
Verifying unit: 25 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/gc/actormap.c"
Verifying unit: 26 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/lang/lsda.c"
Verifying unit: 27 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/lang/socket.c"
Verifying unit: 28 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/mem/alloc.c"
Verifying unit: 29 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/options/options.c"
Verifying unit: 30 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/platform/threads.c"
Verifying unit: 31 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/sched/mpmcq.c"
Verifying unit: 32 / 32, "/home/sean/code/ponylang/ponyc-4369/src/libponyrt/sched/mutemap.c"
Verifying dwo Units...
No errors.

@SeanTAllen SeanTAllen removed the discuss during sync Should be discussed during an upcoming sync label Sep 5, 2023
SeanTAllen added a commit that referenced this issue Sep 12, 2023
After extensive investigation of the weirdness that is #4369, Joe
and I believe the issue is most likely an LLVM bug. However, we
don't have time to pursue further.

During the investigation, I came to believe that the change to use
CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations
would be prudent.

It prevents the known instances of bugs like #4369 which is nice but
not key to this change. Changes things because they make symptoms
go away isn't a good strategy if you don't understand why. We do not
"fully" understand the impact of this change on bugs like #4369. However,
it appears that CodeGenOpt::None is a code path that sees little use
or testing and is more prone to bugs/regression.

Given that we have seen more than 1 bug that "smells like" #4369, we felt
it was wise to switch to Default from None at least until such time as we
understand #4369 and #3874. LLVM is a complicated beast and using code paths
that are little used can be a dangerous proposition.

This commit doesn't include release notes as we addresses to user facing
instance of #4369 already.
SeanTAllen added a commit that referenced this issue Sep 12, 2023
After extensive investigation of the weirdness that is #4369, Joe
and I believe the issue is most likely an LLVM bug. However, we
don't have time to pursue further.

During the investigation, I came to believe that the change to use
CodeGenOpt::Default instead of CodeGenOpt::None for target optimizations
would be prudent.

It prevents the known instances of bugs like #4369 which is nice but
not key to this change. Changes things because they make symptoms
go away isn't a good strategy if you don't understand why. We do not
"fully" understand the impact of this change on bugs like #4369. However,
it appears that CodeGenOpt::None is a code path that sees little use
or testing and is more prone to bugs/regression.

Given that we have seen more than 1 bug that "smells like" #4369, we felt
it was wise to switch to Default from None at least until such time as we
understand #4369 and #3874. LLVM is a complicated beast and using code paths
that are little used can be a dangerous proposition.

This commit doesn't include release notes as we addresses to user facing
instance of #4369 already.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs investigation This needs to be looked into before its "ready for work" triggers release Major issue that when fixed, results in an "emergency" release
Projects
None yet
Development

No branches or pull requests

4 participants