Redesign syscallbuf to always unwind on interruption #3322

Keno · 2022-07-08T07:45:24Z

This is a major redesign of the syscallbuf code with the goal of
establishing the invariant that we never switch away from a tracee
while it's in the syscallbuf code. Instead, we unwind the syscallbuf
code completely and execute the syscall at a special syscall instruction
now placed in the extended jump patch.

The primary motivation for this that this fixes #3285, but I think
the change is overall very beneficial. We have significant complexity
in the recorder to deal with the possibility of interrupting the tracee
during the syscallbuf. This commit does not yet remove this complexity
(the change is already very big), but that should be easy to do as a
follow up. Additionally, we used to be unable to perform syscall buffering
for syscalls performed inside a signal handler that interrupted a
syscall. This had performance implications on use cases like stack walkers,
which often perform multiple memory-probing system calls for every frame to
deal with the possibility of invalid unwind info.

There are many details here, but here's a high level overview. The layout of
the new extended jump patch is:

<stack setup>
call <syscallbuf_hook>
// Bail path returns here
<stack restore>
syscall
<code from the original patch site>
// Non-bail path returns here.
jmp return_addr

One detail worth mentioning is what happens if a signal gets delivered once
the tracee is out of the syscallbuf, but still in the extended jump patch
(i.e. after the stack restore). In this case, rr will rewrite the ip of the
signal frame to point to the equivalent ip in the original, now patched code
section. Of course the instructions in question are no longer there, but
the CFI will nevertheless be generally accurate for the current register state
(excluding weird CFI that explicitly references the ip of course).
This allows unwinders in the end-user-application to never have to unwind
through any frame in the rr syscallbuf, which seems like a desirable
property. Of course, sigreturn must perform the opposite transformation
to avoid actually returning into a patched-out location.

The main drawback of this scheme is that while the application will never
see a location without CFI, GDB does still lack unwind information in
the extended jump stub. This is not a new problem, but syscall events
are now in the extended jump stub, so they come up quite frequently.

I don't think this is a huge problem - it's basically the same situation
we used to have before the vdso changes. I believe the best way to fix
this would be to establish some way of having rr inform gdb of its
jump patches (in fact gdb already has this kind of mechanism for tracepoints,
it's just not exposed for tracepoints initiated by the gdb server),
but I don't intend to do this myself anytime in the near future.
That said, I should note that doing this would not require any changes
on the record side, so could be done anytime and start working retroactively
for already recorded traces.

Keno · 2022-07-08T07:47:37Z

Note that this is just the x86_64/i686 version of this. Aarch64 merged while I was working on this, so that's not included yet, but should be reasonably straightforward. There's also a little bit of cleanup still to do, but this should work 100% otherwise, so if there's any concerns about the details, we can discuss them now.

Keno · 2022-07-09T09:39:22Z

Alright, I think this is working now everywhere. @rocallahan @yuyichao, please take a look. As I said, there's a fair bit of additional code deletion that could be done as a result of this, but I think we can do that in a follow up PR.

yuyichao · 2022-07-11T14:53:47Z

src/preload/syscall_hook.S

+
+        // If the function requested the bail path, rewrite the return address
+        ldr x0, [sp, 688]
+        add x0, x0, 8


I guess we’d better not be using return address signing here…

And this is overwriting the parent frame right? Might be good to add a comment below around the stack frame layout to document this assumption.

I guess we’d better not be using return address signing here…

Yeah, that'd probably need some other scheme

And this is overwriting the parent frame right? Might be good to add a comment below around the stack frame layout to document this assumption.

Yes, it's overwriting the return address in the parent frame.

Yeah, that'd probably need some other scheme

I think just unsign and resign the printer would suffice. It was just funny that this is basically exactly a ROP attack on the caller...

yuyichao · 2022-07-11T14:58:15Z

src/Monkeypatcher.cc

+ *    b       .Lreturn
+.Lbail:
+ *    ldp     x15, x30, [x15]
+ **   // Safe suffix starts here


Maybe update the comment below since it’s not just used for the test anymore. It might make real syscall.

yuyichao · 2022-07-11T15:07:02Z

src/Monkeypatcher.cc

-    return remote_code_ptr(bp.as_int());
+  patched_syscall *ps = &syscall_stub_list[it->second];
+  auto bp = it->first + ps->size - ps->safe_suffix;
+  if (pp == bp - 4 || pp == bp - 8) {


I assume we won’t be expecting the exit breakpoint when we are in the bail path? (Otherwise this misses the bail path return)

Is the breakpoint address returned here correct? It seems that it is pointing to the data part? (Safe suffix is larger than 8. I assume you meant to use the last instruction which should be - 12 due to the return address I stuffed at the end.

(The test failure that required this happens in cont_race about 1-3% of the time iirc.)

Ok - I wasn't entirely sure this breakpoint is actually still required, so I'll need to look into that.

This breakpoint is needed if we got a signal when we are in the return path of the stub and decided to defer it until the syscallbuf business is done. This would not be needed if we are grantee to not get a signal (comes to think about it I don't think we can grantee that for external signals) or if we just (manually or actually) single step the instructions out. It seems that we may need this on x86 as well now? Or at least a safe suffix so that we can deliver signal when we are back in the stub.

rocallahan · 2022-07-12T05:53:41Z

I think probably we should make a new rr release before we land this ... I expect this will destabilize things for a while.

Keno · 2022-07-12T19:23:22Z

I think probably we should make a new rr release before we land this ... I expect this will destabilize things for a while.

That's probably a good idea.

This is a major redesign of the syscallbuf code with the goal of establishing the invariant that we never switch away from a tracee while it's in the syscallbuf code. Instead, we unwind the syscallbuf code completely and execute the syscall at a special syscall instruction now placed in the extended jump patch. The primary motivation for this that this fixes #3285, but I think the change is overall very beneficial. We have significant complexity in the recorder to deal with the possibility of interrupting the tracee during the syscallbuf. This commit does not yet remove this complexity (the change is already very big), but that should be easy to do as a follow up. Additionally, we used to be unable to perform syscall buffering for syscalls performed inside a signal handler that interrupted a syscall. This had performance implications on use cases like stack walkers, which often perform multiple memory-probing system calls for every frame to deal with the possibility of invalid unwind info. There are many details here, but here's a high level overview. The layout of the new extended jump patch is: ``` <stack setup> call <syscallbuf_hook> // Bail path returns here <stack restore> syscall <code from the original patch site> // Non-bail path returns here. jmp return_addr ``` One detail worth mentioning is what happens if a signal gets delivered once the tracee is out of the syscallbuf, but still in the extended jump patch (i.e. after the stack restore). In this case, rr will rewrite the ip of the signal frame to point to the equivalent ip in the original, now patched code section. Of course the instructions in question are no longer there, but the CFI will nevertheless be generally accurate for the current register state (excluding weird CFI that explicitly references the ip of course). This allows unwinders in the end-user-application to never have to unwind through any frame in the rr syscallbuf, which seems like a desirable property. Of course, `sigreturn` must perform the opposite transformation to avoid actually returning into a patched-out location. The main drawback of this scheme is that while the application will never see a location without CFI, GDB does still lack unwind information in the extended jump stub. This is not a new problem, but syscall events are now in the extended jump stub, so they come up quite frequently. I don't think this is a huge problem - it's basically the same situation we used to have before the vdso changes. I believe the best way to fix this would be to establish some way of having rr inform gdb of its jump patches (in fact gdb already has this kind of mechanism for tracepoints, it's just not exposed for tracepoints initiated by the gdb server), but I don't intend to do this myself anytime in the near future. That said, I should note that doing this would not require any changes on the record side, so could be done anytime and start working retroactively for already recorded traces.

This adds a test case to model #3285, where the test case pokes the sigframe to force sigreturn to switch to a different function than that which incurred the signal.

Keno · 2022-07-17T02:57:31Z

Rebased. Shall we go ahead with the release on master then? FWIW, I've used this a bit so far and have so far not found any new issues that came down to this change, but still an intermediate release is probably prudent.

rocallahan · 2022-09-24T03:21:17Z

Let's move forward with this! Can you rebase it? I promise I'll review it :-)

cebtenzzre · 2023-09-29T14:58:11Z

I'm still interested in this change. I ran into "Expected syscall_bp_vm to be clear" today while debugging a python process.

Keno force-pushed the kf/syscallbufredesign branch 3 times, most recently from 562ca4c to 7eb33ab Compare July 9, 2022 09:38

yuyichao reviewed Jul 11, 2022

View reviewed changes

Keno added 3 commits July 17, 2022 02:55

Add test case for switching out of desched syscall

618715c

This adds a test case to model #3285, where the test case pokes the sigframe to force sigreturn to switch to a different function than that which incurred the signal.

Allow deterministic signals during syscallbuf, but give a loud error

df8d55c

Keno force-pushed the kf/syscallbufredesign branch from 7eb33ab to d332b52 Compare July 17, 2022 02:55

Address Yichao's review

d6dc859

Keno force-pushed the kf/syscallbufredesign branch from d332b52 to d6dc859 Compare July 17, 2022 03:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign syscallbuf to always unwind on interruption #3322

Redesign syscallbuf to always unwind on interruption #3322

Keno commented Jul 8, 2022

Keno commented Jul 8, 2022

Keno commented Jul 9, 2022

yuyichao Jul 11, 2022

yuyichao Jul 11, 2022

Keno Jul 11, 2022

yuyichao Jul 12, 2022

yuyichao Jul 11, 2022

yuyichao Jul 11, 2022

yuyichao Jul 11, 2022

Keno Jul 11, 2022

yuyichao Jul 12, 2022

rocallahan commented Jul 12, 2022

Keno commented Jul 12, 2022

Keno commented Jul 17, 2022

rocallahan commented Sep 24, 2022

cebtenzzre commented Sep 29, 2023 •

edited

Redesign syscallbuf to always unwind on interruption #3322

Are you sure you want to change the base?

Redesign syscallbuf to always unwind on interruption #3322

Conversation

Keno commented Jul 8, 2022

Keno commented Jul 8, 2022

Keno commented Jul 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rocallahan commented Jul 12, 2022

Keno commented Jul 12, 2022

Keno commented Jul 17, 2022

rocallahan commented Sep 24, 2022

cebtenzzre commented Sep 29, 2023 • edited

cebtenzzre commented Sep 29, 2023 •

edited