Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool.record_filter_bycore_multi failured due to inaccurate pi estimation #6755

Open
derekbruening opened this issue Apr 4, 2024 · 3 comments

Comments

@derekbruening
Copy link
Contributor

The tool.record_filter_bycore_multi test failed on a64 sve at https://github.com/DynamoRIO/dynamorio/actions/runs/8546804195/job/23417839931?pr=6754

The output mismatch looks like it's b/c it estimated pi at 3.94... instead of 3.14...

2024-04-03T23:32:41.7151484Z 309: Running postcmd4 |/opt/actions-runner/_work/dynamorio/dynamorio/build/build_debug-internal-64/clients/bin64/drcachesim;-indir;tool.record_filter_bycore_multi.filtered.dir;-simulator_type;schedule_stats|
2024-04-03T23:32:41.7153291Z 309: CMake Error at /opt/actions-runner/_work/dynamorio/dynamorio/suite/tests/runmulti.cmake:138 (message):
2024-04-03T23:32:41.7154282Z 309:   output |Estimation of pi is 3.943208188861861
2024-04-03T23:32:41.7154698Z 309: 
2024-04-03T23:32:41.7154952Z 309:   Trace invariant checks passed
2024-04-03T23:32:41.7155312Z 309: 
2024-04-03T23:32:41.7155609Z 309:   Output 2098798 entries from 2098250 entries.
2024-04-03T23:32:41.7156026Z 309: 
2024-04-03T23:32:41.7156271Z 309:   Schedule stats tool results:
2024-04-03T23:32:41.7156619Z 309: 
2024-04-03T23:32:41.7156846Z 309:   Total counts:
2024-04-03T23:32:41.7157122Z 309: 
2024-04-03T23:32:41.7157346Z 309:              3 cores
2024-04-03T23:32:41.7157645Z 309:              3 threads
2024-04-03T23:32:41.7157968Z 309:         215938 instructions
2024-04-03T23:32:41.7158346Z 309:              9 total context switches
2024-04-03T23:32:41.7158880Z 309:      0.0416786 CSPKI (context switches per 1000 instructions)
2024-04-03T23:32:41.7159455Z 309:          23993 instructions per context switch
2024-04-03T23:32:41.7160100Z 309:              9 voluntary context switches
2024-04-03T23:32:41.7160551Z 309:              0 direct context switches
2024-04-03T23:32:41.7160965Z 309:         100.00% voluntary switches
2024-04-03T23:32:41.7161356Z 309:           0.00% direct switches
2024-04-03T23:32:41.7161724Z 309:             71 system calls
2024-04-03T23:32:41.7162279Z 309:             14 maybe-blocking system calls
2024-04-03T23:32:41.7162734Z 309:              0 direct switch requests
2024-04-03T23:32:41.7163312Z 309:              0 waits
2024-04-03T23:32:41.7163637Z 309:        1750080 idles
2024-04-03T23:32:41.7163998Z 309:          10.98% cpu busy by record count
2024-04-03T23:32:41.7164414Z 309:         558629 cpu microseconds
2024-04-03T23:32:41.7164792Z 309:         874238 idle microseconds
2024-04-03T23:32:41.7165222Z 309:         110044 idle microseconds at last instr
2024-04-03T23:32:41.7165789Z 309:          38.99% cpu busy by time
2024-04-03T23:32:41.7166294Z 309:          83.54% cpu busy by time, ignoring idle past last instr
2024-04-03T23:32:41.7166776Z 309: 
2024-04-03T23:32:41.7166996Z 309:   Core #0 counts:
2024-04-03T23:32:41.7167271Z 309: 
2024-04-03T23:32:41.7167493Z 309:              2 threads
2024-04-03T23:32:41.7167817Z 309:          10910 instructions
2024-04-03T23:32:41.7168191Z 309:              4 total context switches
2024-04-03T23:32:41.7168715Z 309:      0.3666361 CSPKI (context switches per 1000 instructions)
2024-04-03T23:32:41.7169280Z 309:           2728 instructions per context switch
2024-04-03T23:32:41.7169759Z 309:              4 voluntary context switches
2024-04-03T23:32:41.7170200Z 309:              0 direct context switches
2024-04-03T23:32:41.7170619Z 309:         100.00% voluntary switches
2024-04-03T23:32:41.7171004Z 309:           0.00% direct switches
2024-04-03T23:32:41.7171367Z 309:             19 system calls
2024-04-03T23:32:41.7171895Z 309:              3 maybe-blocking system calls
2024-04-03T23:32:41.7172344Z 309:              0 direct switch requests
2024-04-03T23:32:41.7172728Z 309:              0 waits
2024-04-03T23:32:41.7173032Z 309:         720111 idles
2024-04-03T23:32:41.7173375Z 309:           1.49% cpu busy by record count
2024-04-03T23:32:41.7173790Z 309:         558629 cpu microseconds
2024-04-03T23:32:41.7174171Z 309:          18290 idle microseconds
2024-04-03T23:32:41.7174602Z 309:          18290 idle microseconds at last instr
2024-04-03T23:32:41.7175176Z 309:          96.83% cpu busy by time
2024-04-03T23:32:41.7175699Z 309:          96.83% cpu busy by time, ignoring idle past last instr
2024-04-03T23:32:41.7176179Z 309: 
2024-04-03T23:32:41.7176406Z 309:   Core #1 counts:
2024-04-03T23:32:41.7176681Z 309: 
2024-04-03T23:32:41.7176906Z 309:              2 threads
2024-04-03T23:32:41.7177225Z 309:         169283 instructions
2024-04-03T23:32:41.7177597Z 309:              2 total context switches
2024-04-03T23:32:41.7178126Z 309:      0.0118145 CSPKI (context switches per 1000 instructions)
2024-04-03T23:32:41.7178687Z 309:          84642 instructions per context switch
2024-04-03T23:32:41.7179165Z 309:              2 voluntary context switches
2024-04-03T23:32:41.7179702Z 309:              0 direct context switches
2024-04-03T23:32:41.7180111Z 309:         100.00% voluntary switches
2024-04-03T23:32:41.7180497Z 309:           0.00% direct switches
2024-04-03T23:32:41.7180865Z 309:             36 system calls
2024-04-03T23:32:41.7181385Z 309:              6 maybe-blocking system calls
2024-04-03T23:32:41.7181830Z 309:              0 direct switch requests
2024-04-03T23:32:41.7182213Z 309:              0 waits
2024-04-03T23:32:41.7182515Z 309:         463025 idles
2024-04-03T23:32:41.7182860Z 309:          26.77% cpu busy by record count
2024-04-03T23:32:41.7183274Z 309:              0 cpu microseconds
2024-04-03T23:32:41.7183650Z 309:         613649 idle microseconds
2024-04-03T23:32:41.7184076Z 309:          33267 idle microseconds at last instr
2024-04-03T23:32:41.7184526Z 309:           0.00% cpu busy by time
2024-04-03T23:32:41.7185027Z 309:           0.00% cpu busy by time, ignoring idle past last instr
2024-04-03T23:32:41.7185505Z 309: 
2024-04-03T23:32:41.7185728Z 309:   Core #2 counts:
2024-04-03T23:32:41.7185997Z 309: 
2024-04-03T23:32:41.7186222Z 309:              1 threads
2024-04-03T23:32:41.7186544Z 309:          35745 instructions
2024-04-03T23:32:41.7186916Z 309:              3 total context switches
2024-04-03T23:32:41.7187445Z 309:      0.0839278 CSPKI (context switches per 1000 instructions)
2024-04-03T23:32:41.7188129Z 309:          11915 instructions per context switch
2024-04-03T23:32:41.7188615Z 309:              3 voluntary context switches
2024-04-03T23:32:41.7189058Z 309:              0 direct context switches
2024-04-03T23:32:41.7189471Z 309:         100.00% voluntary switches
2024-04-03T23:32:41.7189860Z 309:           0.00% direct switches
2024-04-03T23:32:41.7190323Z 309:             16 system calls
2024-04-03T23:32:41.7190947Z 309:              5 maybe-blocking system calls
2024-04-03T23:32:41.7191400Z 309:              0 direct switch requests
2024-04-03T23:32:41.7191785Z 309:              0 waits
2024-04-03T23:32:41.7192087Z 309:         566944 idles
2024-04-03T23:32:41.7192432Z 309:           5.93% cpu busy by record count
2024-04-03T23:32:41.7192841Z 309:              0 cpu microseconds
2024-04-03T23:32:41.7193219Z 309:         242299 idle microseconds
2024-04-03T23:32:41.7193655Z 309:          58487 idle microseconds at last instr
2024-04-03T23:32:41.7194096Z 309:           0.00% cpu busy by time
2024-04-03T23:32:41.7194598Z 309:           0.00% cpu busy by time, ignoring idle past last instr
2024-04-03T23:32:41.7195074Z 309: 
2024-04-03T23:32:41.7195313Z 309:   Core #0 schedule: S_O_O_O_S
2024-04-03T23:32:41.7195653Z 309: 
2024-04-03T23:32:41.7195894Z 309:   Core #1 schedule: T_O_O_
2024-04-03T23:32:41.7196217Z 309: 
2024-04-03T23:32:41.7196457Z 309:   Core #2 schedule: O_O_O_O_
2024-04-03T23:32:41.7196788Z 309: 
2024-04-03T23:32:41.7197187Z 309:   | failed to match expected output |Estimation of pi is 3.14.*
2024-04-03T23:32:41.7197691Z 309: 
2024-04-03T23:32:41.7197947Z 309:   Trace invariant checks passed
2024-04-03T23:32:41.7198308Z 309: 
2024-04-03T23:32:41.7198566Z 309:   Output .* entries from .* entries.
2024-04-03T23:32:41.7198938Z 309: 
2024-04-03T23:32:41.7199190Z 309:   Schedule stats tool results:
2024-04-03T23:32:41.7199536Z 309: 
2024-04-03T23:32:41.7280155Z 309:   .*
2024-04-03T23:32:41.7280612Z 309: 
2024-04-03T23:32:41.7281194Z 309:   Core #0 schedule: .*
2024-04-03T23:32:41.7281607Z 309: 
2024-04-03T23:32:41.7281861Z 309:   Core #1 schedule: .*
2024-04-03T23:32:41.7282169Z 309: 
2024-04-03T23:32:41.7282397Z 309:   Core #2 schedule: .*
2024-04-03T23:32:41.7282697Z 309: 
2024-04-03T23:32:41.7282907Z 309:   |
2024-04-03T23:32:41.7283118Z 309: 
2024-04-03T23:32:41.7283319Z 309: 
2024-04-03T23:32:41.7283894Z 310/377 Test #309: code_api|tool.record_filter_bycore_multi .........................***Failed    8.49 sec
@derekbruening
Copy link
Contributor Author

This passed on a re-run. It passes every time on the non-sve machine (don't have simple direct access to the sve machine).

@abhinav92003
Copy link
Contributor

This issue happens only on sve it seems, but it is similar to a recent non-sve issue that was observed on signalNNN1 tests; where the result of some trig math floating-point ops changed and became non-deterministic after PR #6725.

Since #6725 adjusted some sve and non-sve logic too, maybe both of these are indeed related.

abhinav92003 added a commit that referenced this issue Apr 5, 2024
Fixes the slot used to save and restore FP regs at fcache enter and
return events. PR #6725 adjusted the slots used during signal handling
in core/unix/signal_linux_aarch64.c but did not adjust the same in
fcache enter/return and attach events. Prior to that PR, the FP regs
were simply stored in a contiguous manner in signal handling code and
fcache enter/return routines (instead of in their respective dr_simd_t
struct member), which was a bit confusing.

The mismatch between slot usage in signal handling and fcache
enter/return code caused failures in the signalNNN1 tests on some
systems. Verified that those tests pass with this fix.

Also fixes the same issue in save_priv_mcontext_helper which is used in
the dr_app_start API. Unit tests for this scenario will be added as part
of #6759.

Issue: #5036, #6755, #5365
Fixes #6758
@abhinav92003
Copy link
Contributor

PR #6758 may help with this issue. I haven't used tmate to verify the failure count post that PR though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants