Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triage mode crashes when all CPUs in socket fail #250

Closed
thiagomacieira opened this issue Jun 20, 2023 · 3 comments · May be fixed by #254
Closed

Triage mode crashes when all CPUs in socket fail #250

thiagomacieira opened this issue Jun 20, 2023 · 3 comments · May be fixed by #254
Assignees

Comments

@thiagomacieira
Copy link
Contributor

No time to investigate the cause. This is just the symptom:

SANDSTONE_MOCK_TOPOLOGY='0 0:1 1 1:1' ./opendcdiag -o - --selftests -e selftest_fail_socket1 -T 25ms --retest-on-failure=0 --cpuset=2,3
THIS IS AN UNOPTIMIZED BUILD: DON'T TRUST TEST TIMING!
command-line: 'opendcdiag -o - --selftests -e selftest_fail_socket1 -T 25ms --retest-on-failure=0 --cpuset=2,3'
version: opendcdiag-9e42ea36e98e-dirty
os: Linux 6.3.7-1-default, glibc 2.37
timing: { duration: 1000.000, timeout: 300000.000 }
cpu-info:
  0: { logical: 2, package: 1, core: 0, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
  1: { logical: 3, package: 1, core: 1, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
tests:
- test: selftest_fail_socket1
  details: { quality: beta, description: "Fails on any thread of socket 1" }
  state: { seed: 'AES:2c2e316271ef2b52d2dad9e0f4e4eaadd3d1ce9d8e10d4ad2d25261f0b1b1552', iteration: 0, retry: false }
  time-at-start: { elapsed:      0.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  result: fail
  fail: { cpu-mask: '__:XX', time-to-fail: 3.177, seed: 'AES:2c2e316271ef2b52d2dad9e0f4e4eaadd3d1ce9d8e10d4ad2d25261f0b1b1552'}
  time-at-end:   { elapsed:      4.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  test-runtime: 4.487
  threads:
  - thread: 0
    id: { logical: 2, package: 1, core: 0, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 3.971
    loop-count: 0
    messages:
  - thread: 1
    id: { logical: 3, package: 1, core: 1, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 3.177
    loop-count: 0
    messages:
# Test failed 1 out of 1 times (100.0%)
# All CPUs failed equally. This is highly unlikely (SW bug?)
# Loop iteration 1 finished, average time 6.92119 ms, total 6.92119 ms
- test: mce_check
  details: { quality: production, description: "Machine Check Exceptions/Events count" }
  state: { seed: 'AES:ebcfaed50725d0cc20106f752d2f5a91d3d1ce9d8e10d4ad2d25261f0b1b1552', iteration: 0, retry: false }
  time-at-start: { elapsed:      8.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  result: pass
  time-at-end:   { elapsed:     16.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  test-runtime: 6.797
- test: selftest_fail_socket1
  details: { quality: beta, description: "Fails on any thread of socket 1" }
  state: { seed: 'AES:7224cb33e780b98bc8c48abac8c11bfad3d1ce9d8e10d4ad2d25261f0b1b1552', iteration: 0, retry: false }
  time-at-start: { elapsed:     16.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  result: fail
  fail: { cpu-mask: '__:XX', time-to-fail: 0.433, seed: 'AES:7224cb33e780b98bc8c48abac8c11bfad3d1ce9d8e10d4ad2d25261f0b1b1552'}
  time-at-end:   { elapsed:     16.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  test-runtime: 0.631
  threads:
  - thread: 0
    id: { logical: 2, package: 1, core: 0, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 0.434
    loop-count: 0
    messages:
  - thread: 1
    id: { logical: 3, package: 1, core: 1, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 0.433
    loop-count: 0
    messages:
# Test failed 1 out of 1 times (100.0%)
# All CPUs failed equally. This is highly unlikely (SW bug?)
# Loop iteration 2 finished, average time 10.1804 ms, total 20.3609 ms
- test: selftest_fail_socket1
  details: { quality: beta, description: "Fails on any thread of socket 1" }
  state: { seed: 'AES:4cbfc8b4f59547244c21cf21389160e2d3d1ce9d8e10d4ad2d25261f0b1b1552', iteration: 0, retry: false }
  time-at-start: { elapsed:     16.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  result: fail
  fail: { cpu-mask: '__:XX', time-to-fail: 0.348, seed: 'AES:4cbfc8b4f59547244c21cf21389160e2d3d1ce9d8e10d4ad2d25261f0b1b1552'}
  time-at-end:   { elapsed:     20.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  test-runtime: 0.525
  threads:
  - thread: 0
    id: { logical: 2, package: 1, core: 0, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 0.348
    loop-count: 0
    messages:
  - thread: 1
    id: { logical: 3, package: 1, core: 1, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 0.349
    loop-count: 0
    messages:
# Test failed 1 out of 1 times (100.0%)
# All CPUs failed equally. This is highly unlikely (SW bug?)
- test: selftest_fail_socket1
  details: { quality: beta, description: "Fails on any thread of socket 1" }
  state: { seed: 'AES:edb1716d18f1f8b039001f61400a5d12d3d1ce9d8e10d4ad2d25261f0b1b1552', iteration: 0, retry: false }
  time-at-start: { elapsed:     20.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  result: fail
  fail: { cpu-mask: '__:XX', time-to-fail: 0.388, seed: 'AES:edb1716d18f1f8b039001f61400a5d12d3d1ce9d8e10d4ad2d25261f0b1b1552'}
  time-at-end:   { elapsed:     20.000, now: !!timestamp '2023-06-20T16:33:18Z' }
  test-runtime: 0.579
  threads:
  - thread: 0
    id: { logical: 2, package: 1, core: 0, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 0.397
    loop-count: 0
    messages:
  - thread: 1
    id: { logical: 3, package: 1, core: 1, thread: 0, family: 6, model: 0x3c, stepping: 0, microcode: null, ppin: null }
    state: failed
    time-to-fail: 0.388
    loop-count: 0
    messages:
# Test failed 1 out of 1 times (100.0%)
# All CPUs failed equally. This is highly unlikely (SW bug?)
framework/topology.cpp:694: void load_cpu_info(const LogicalProcessorSet&): Assertion `sApp->thread_count' failed.
(gdb) bt
#0  0x00007ffff5a92a7c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff5a41226 in raise () from /lib64/libc.so.6
#2  0x00007ffff5a29897 in abort () from /lib64/libc.so.6
#3  0x00007ffff5a297ab in __assert_fail_base.cold () from /lib64/libc.so.6
#4  0x00007ffff5a394b6 in __assert_fail () from /lib64/libc.so.6
#5  0x00000000004873ea in load_cpu_info (enabled_cpus=...) at framework/topology.cpp:694
#6  0x000000000045d58b in run_tests_on_cpu_set (tests=std::vector of length 3, capacity 4 = {...}, set=...)
    at framework/sandstone.cpp:2546
#7  0x000000000045df72 in run_triage (triage_tests=std::vector of length 3, capacity 4 = {...}, enabled_cpus=...)
    at framework/sandstone.cpp:2673
#8  0x00000000004608c6 in main (argc=10, argv=0x7fffffffdc68) at framework/sandstone.cpp:3796
(gdb) f 6
#6  0x000000000045d58b in run_tests_on_cpu_set (tests=std::vector of length 3, capacity 4 = {...}, set=...)
(gdb) p set
$1 = (const LogicalProcessorSet &) @0x7fffffffd390: {static Size = 1024, array = {0 <repeats 16 times>}}

The set of processors to test is empty.

@thiagomacieira
Copy link
Contributor Author

Very likely related: in non-debug build, we've seen:

[   13.273269] ok   2 testname      # SKIP(RuntimeSkipCategory: All CPUs skipped while executing 'test_run()' function, check log for details)
[   13.275270] ok   4 testname      # SKIP(RuntimeSkipCategory: All CPUs skipped while executing 'test_run()' function, check log for details)
[   13.277270] ok   6 testname      # SKIP(RuntimeSkipCategory: All CPUs skipped while executing 'test_run()' function, check log for details)

The test does not have EXIT_SKIP in its run function, so it's impossible for its run function to skip. That must mean the test function was not executed at all and the framework somehow concludes that this is a skip. The timestamp showing the executions are taking about 2 ms are also indicative of nothing getting run.

The ok numbers incrementing by 2 were in the original log file; that's not a copy/paste issue.

@jposwiata jposwiata self-assigned this Jun 22, 2023
@jposwiata
Copy link
Contributor

jposwiata commented Jun 22, 2023

The issue pops up when package #0 has no cores in the topology (either because of --cpuset or "slicing").
'ok' numbers increments were caused by double increment for test run (in "common" run_one_test() and in run_tests_on_cpu_set()).
In some scenarios it was possible to hit multiple SKIP conditions which added multiple 'OK's to the log (what was the potential rootcause of some regressions).

@thiagomacieira
Copy link
Contributor Author

Doesn't happen in the socket-separation branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants