Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client.detach_test failing: kill cannot find target process #6536

Open
derekbruening opened this issue Jan 2, 2024 · 34 comments · May be fixed by #6579
Open

client.detach_test failing: kill cannot find target process #6536

derekbruening opened this issue Jan 2, 2024 · 34 comments · May be fixed by #6579

Comments

@derekbruening
Copy link
Contributor

The Linux client.detach_test was just added in PR #6513 and it is failing (non-deterministically) on x86-64.

https://github.com/DynamoRIO/dynamorio/actions/runs/7381905578/job/20080993968?pr=6531

      Test #185: code_api|client.detach_test ......................................***Failed  Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]  0.26 sec
CMake Error at /home/runner/work/dynamorio/dynamorio/suite/tests/runall.cmake:130 (message):
  *** kill failed (1): kill: (24600): No such process
  ***
Call Stack (most recent call first):
  /home/runner/work/dynamorio/dynamorio/suite/tests/runall.cmake:317 (kill_background_process)
@derekbruening
Copy link
Contributor Author

Xref client.attach_test failing with a similar error; the frequency of its failures may have increased with PR #6513? Maybe not enough data to say for that but it failed 2 out of 4 times in recent PR's.

@derekbruening
Copy link
Contributor Author

@onroadmuwl hoping you can help eliminate the flakiness in this new test that you added

@onroadmuwl
Copy link
Contributor

onroadmuwl commented Jan 3, 2024

@onroadmuwl hoping you can help eliminate the flakiness in this new test that you added

I'm sorry to hear that. I'm not able to reproduce this problem in my local testing. And I haven't seen similar problems when I submitted code for previous PR. I guess there may be a conflict with the newly submitted code. I'm trying to debug the test suite step by step in a new PR #6537. If I delete the test code for Linux's detachment, the attach test works well. So I think there may be some problems in the detach test instead of real detach code's problem.

@derekbruening
Copy link
Contributor Author

The attach test was failing every once in a while with the same output: #6452. So it may be the same underlying pre-existing issue that is just more frequent in the detach test.

From the output it looks like the attach never happened b/c it never prints "thank you for testing..."?

@derekbruening
Copy link
Contributor Author

@onroadmuwl did you try running many times inside the ctest framework to see if it reproduces non-deterministically on your local machine?
E.g.:

$ ctest --repeat-until-fail 500 -R client.detach
Test project /usr/local/google/home/bruening/dr/git/build_x64_dbg_tests
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.35 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......***Failed  Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]  0.44 sec

0% tests passed, 1 tests failed out of 1

@onroadmuwl
Copy link
Contributor

The attach is confirmed by a variable saw_attach_event because of the uncertain order of thread init and attach. And we never wait for the output of thank you for testing... inside the runall.cmake file.

dr_exit(void)
{
    if (!saw_attach_event)
        dr_fprintf(STDERR, "Error: never saw attach event!\n");
#ifdef WINDOWS
    dr_fprintf(STDERR, "done\n");
#else
        /* The app prints 'done' for us. */
#endif
}

Is it possible to solve this problem by letting the detach test execute by launching instead of attaching?

@onroadmuwl
Copy link
Contributor

onroadmuwl commented Jan 3, 2024

@onroadmuwl did you try running many times inside the ctest framework to see if it reproduces non-deterministically on your local machine? E.g.:

$ ctest --repeat-until-fail 500 -R client.detach
Test project /usr/local/google/home/bruening/dr/git/build_x64_dbg_tests
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.35 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......   Passed    0.34 sec
    Start 185: code_api|client.detach_test
    Test #185: code_api|client.detach_test ......***Failed  Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]  0.44 sec

0% tests passed, 1 tests failed out of 1

Thank you, this method is quite useful. But it seems that this error output in local testing isn't equal to the output of github's test suite.

177/468 Testing: code_api|client.detach_test
177/468 Test: code_api|client.detach_test
Command: "/usr/bin/cmake" "-D" "toolbindir=/home/wlmu/dynamorio/build/bin64" "-D" "cmd=/home/wlmu/dynamorio/build/bin64/run_in_bg@-out@/home/wlmu/dynamorio/build/suite/tests/client.detach_test-out@/home/wlmu/dynamorio/build/bin64/drrun@-no_inject@--@/home/wlmu/dynamorio/build/suite/tests/bin/linux.infloop@-v" "-D" "out=/home/wlmu/dynamorio/build/suite/tests/client.detach_test-out" "-D" "nudge=<detach>@-quiet@-stderr_mask@0xC@-dumpcore_mask@0@-code_api@-c@/home/wlmu/dynamorio/build/suite/tests/bin/libclient.detach_test.dll.so" "-D" "clear=" "-D" "pidfile=" "-P" "/home/wlmu/dynamorio/suite/tests/runall.cmake"
Directory: /home/wlmu/dynamorio/build/suite/tests
"code_api|client.detach_test" start time: Jan 03 23:23 CST
Output:
----------------------------------------------------------
<end of output>
Test time = 1500.13 sec
----------------------------------------------------------
Test Fail Reason:
Required regular expression not found. Regex=[^starting
thank you for testing detach
thread init
detach
done
$
]
"code_api|client.detach_test" end time: Jan 03 23:48 CST
"code_api|client.detach_test" time elapsed: 00:25:00
----------------------------------------------------------

End testing: Jan 03 23:48 CST

@derekbruening
Copy link
Contributor Author

Your local run times out (1500s == 25:00). BTW that is a long timeout: maybe it should be shortened (xref #6127). It is still definitely worth debugging and fixing that local timeout -- aarch64 sees a timeout (#5740). Plus the fix may end up affecting the GA behavior too.

@onroadmuwl
Copy link
Contributor

After completing next week's exams, I will attempt to address the above questions.

@derekbruening
Copy link
Contributor Author

Sure, thank you for your efforts.

@onroadmuwl
Copy link
Contributor

It seems that the flakiness of attach and detach has disappeared inexplicably in PR 6537. In the current code, the done string is printed in infloop.c. With reference to Windows, I once thought that we can reduce attach's flakiness by print done in the attach_test.dll.c. The previous tests follow this rule. But we can't get such a conclusion in recent tests.

@derekbruening
Copy link
Contributor Author

It seems that the flakiness of attach and detach has disappeared inexplicably in #6537.

You mean you can't reproduce the failures on the Github Actions (GA) VM's? I thought you were able to reproduce locally by running --repeat-until-fail or just a shell loop?

@onroadmuwl
Copy link
Contributor

You mean you can't reproduce the failures on the Github Actions (GA) VM's? I thought you were able to reproduce locally by running --repeat-until-fail or just a shell loop?
The flakiness is caused by kill cannot find target process, I can't reproduce the same error on my machine.

@derekbruening
Copy link
Contributor Author

Hmm, it did repro on my machine every so often as shown at #6536 (comment). Not sure when I will have time to try to debug it though.

@onroadmuwl
Copy link
Contributor

onroadmuwl commented Jan 9, 2024

The error of my local testing is 120 - code_api|client.detach_test (Timeout), I can fix my error through the newest code in PR 6537, which modifies the position of printing "done\n". But it can't always fix the errors of Github Actions (GA) VM. You can try to see if this modification works on your machine.

    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......   Passed    0.25 sec
    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......   Passed    0.25 sec
    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......   Passed    0.25 sec
     ......
    Start 120: code_api|client.detach_test
    Test #120: code_api|client.detach_test ......***Timeout 1500.16 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) = 1520.32 sec

The following tests FAILED:
        120 - code_api|client.detach_test (Timeout)
Errors while running CTest

@derekbruening
Copy link
Contributor Author

Does your timeout fix solve the timeouts that we see in these tests in the ci-aarch64-native run on Github Actions? If so, or even if not, it is worth checking in any improvement.

@derekbruening
Copy link
Contributor Author

derekbruening commented Jan 12, 2024

The timeouts turn out to be just missing ptrace privs: #6558

@onroadmuwl
Copy link
Contributor

I have tested again in PR 6537, the detach test can be passed smoothly now.
Would you like to tell me why the detach test was passed sometimes without setting ptrace_scope as 0 ? I'm curious about how the test did it.

@derekbruening
Copy link
Contributor Author

I have tested again in PR 6537, the detach test can be passed smoothly now. Would you like to tell me why the detach test was passed sometimes without setting ptrace_scope as 0 ? I'm curious about how the test did it.

I think it failed every single time on the aarch64 machine due to the ptrace_scope.

@derekbruening
Copy link
Contributor Author

I have tested again in #6537, the detach test can be passed smoothly now.

Oh you mean you figured out the error about not finding the process to kill and having no output?

@derekbruening
Copy link
Contributor Author

I have tested again in #6537, the detach test can be passed smoothly now.

Oh you mean you figured out the error about not finding the process to kill and having no output?

Please change #6537 from Draft and request a review.

@onroadmuwl
Copy link
Contributor

All right,I haven‘t find the solution of this problem. I misunderstand that the flakiness on X64 is also caused by ptrace privilege.

@derekbruening
Copy link
Contributor Author

This reproduces in release build on a local machine.

Success:

183: Running |dr/git/build_x64_rel_tests/bin64/run_in_bg;-out;dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out;dr/git/build_x64_rel_tests/bin64/drrun;-no_inject;--;dr/git/build_x64_rel_tests/suite/tests/bin/linux.infloop;-v|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out| contents
183: Running |run_in_bg dr/git/build_x64_rel_tests/bin64/drrun;-attach;3620449;-takeover_sleep;-takeovers;1000;-quiet;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-c;dr/git/build_x64_rel_tests/suite/tests/bin/libclient.detach_test.dll.so|
183: Waiting for thread init in |starting
183: |
183: Running |drconfig -detach 3620449|
183: Waiting for detach in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for done in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: starting
183: thank you for testing detach
183: thread init
183: detach
183: done

Failure:

183: Running |dr/git/build_x64_rel_tests/bin64/run_in_bg;-out;dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out;dr/git/build_x64_rel_tests/bin64/drrun;-no_inject;--;dr/git/build_x64_rel_tests/suite/tests/bin/linux.infloop;-v|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: Waiting for |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out| contents
183: Running |run_in_bg dr/git/build_x64_rel_tests/bin64/drrun;-attach;3616219;-takeover_sleep;-takeovers;1000;-quiet;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-c;dr/git/build_x64_rel_tests/suite/tests/bin/libclient.detach_test.dll.so|
183: Waiting for thread init in |starting
183: |
183: Running |drconfig -detach 3616219|
183: Waiting for detach in |dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out|
183: CMake Error at dr/git/src/suite/tests/runall.cmake:131 (message):
183:   *** kill failed (1): /usr/bin/kill: (3616219): No such process

The file:

$ cat dr/git/build_x64_rel_tests/suite/tests/client.detach_test-out
starting
thank you for testing detach
thread init
detach

@derekbruening
Copy link
Contributor Author

That kill that failed I believe is sending SIGTERM which is the expected method for the infloop app to exit. So I would assume that the detach crashed the process (after printing "detach") and so it's gone by the time the SIGTERM is sent.

@onroadmuwl
Copy link
Contributor

If the detach crashed the process on X86-64 platform sometimes , it may happen at the stage of setting nudged thread's context. On the X86-64 platform, I didn't use the methods of sigreturn before due to the lack of fpstate information inside dcontext. But the fpstate information can be also obtained by the sc(sigcontext). So I create a new pr in #6579 to detach the process by sigreturn on X86-64 platform. It works well on my local machine(repeat-until-fail). Could you try to see if it works on your machine stably?

@derekbruening
Copy link
Contributor Author

It still fails locally and on the x86-64 job (after a re-run; passed 1st time): #6579 (comment)
You can test it by just re-running the x86-64 job as it seems to fail at least half the time.

We are going to have to mark this test as flaky to get our suite green. Please continue working on it and when the detach feature is fully working we can un-mark the test.

@onroadmuwl
Copy link
Contributor

I'll work on it and submit the related code if feasible solutions are found.

@derekbruening
Copy link
Contributor Author

So you still cannot reproduce on any of your local machines, and you couldn't reproduce under tmate on the Github Actions runners here? It is failing every other time so you'd think it would reproduce under tmate on the Actions runners.

xdje42 added a commit that referenced this issue Jan 25, 2024
Adds two tests to the unix ignore list:

i#6452 code_api|client.attach_test
i#6536 code_api|client.detach_test

Issue: #6452, #6536
xdje42 added a commit that referenced this issue Jan 25, 2024
#6588)

Adds two tests to the x86-64 unix ignore list:

i#6452 code_api|client.attach_test
i#6536 code_api|client.detach_test

Issue: #6452, #6536
@onroadmuwl
Copy link
Contributor

I have reproduced the failure result of timeout on another new x86-64 machine. And I fixed this error by printing the done in detach_test.dll.c instead of infloop.c. It works well on the new machine now. I re-ran the GA action more than five time and the failure of detach_test hasn't appeared (PR #6579). Could you please try to test it on your machine again?

By the way, all of my modifications are based on the previous version of DynamoRIO. There will be the following error when I try to build the newest DynamoRIO. Would you like to tell me why this error occurred if you ever met it?

-- Using elfutils
CMake Error at ext/drsyms/CMakeLists.txt:176 (message):
  Failed to find single source for
  /home/wlmu/dynamorio-detach_x86/ext/drsyms/elfutils/dwarf_begin_elf.c.patch


-- Configuring incomplete, errors occurred!
See also "/home/wlmu/dynamorio-detach_x86/build/CMakeFiles/CMakeOutput.log".
See also "/home/wlmu/dynamorio-detach_x86/build/CMakeFiles/CMakeError.log".

@derekbruening
Copy link
Contributor Author

Did you initialize the elfutils submodule? I.e., is the directory empty? It needs to be initialized.

@derekbruening
Copy link
Contributor Author

And I fixed this error by printing the done in detach_test.dll.c instead of infloop.c

But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?

@onroadmuwl
Copy link
Contributor

But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?

In the 317 line of suite/tests/runall.cmake file, there is a "kill_background_process(OFF)" command. This command is executed after detaching and before printing done. If the program is crashed after detaching, there will print an error like "kill cannot find target process". So it can ensure that the program isn't crashed. And this modification refers to the detachment test on Windows platform.

@onroadmuwl
Copy link
Contributor

Did you initialize the elfutils submodule? I.e., is the directory empty? It needs to be initialized.

It works for me, thanks.

@onroadmuwl
Copy link
Contributor

But wouldn't this method fail to catch a post-detach crash? How would we know the application survived past the detach?

In the 317 line of suite/tests/runall.cmake file, there is a "kill_background_process(OFF)" command. This command is executed after detaching and before printing done. If the program is crashed after detaching, there will print an error like "kill cannot find target process". So it can ensure that the program isn't crashed. And this modification refers to the detachment test on Windows platform.

It's also unstable sometimes. https://github.com/DynamoRIO/dynamorio/actions/runs/7698709453/job/20978621263?pr=6579

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants