Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign syscallbuf to always unwind on interruption #3322

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Keno
Copy link
Member

@Keno Keno commented Jul 8, 2022

This is a major redesign of the syscallbuf code with the goal of
establishing the invariant that we never switch away from a tracee
while it's in the syscallbuf code. Instead, we unwind the syscallbuf
code completely and execute the syscall at a special syscall instruction
now placed in the extended jump patch.

The primary motivation for this that this fixes #3285, but I think
the change is overall very beneficial. We have significant complexity
in the recorder to deal with the possibility of interrupting the tracee
during the syscallbuf. This commit does not yet remove this complexity
(the change is already very big), but that should be easy to do as a
follow up. Additionally, we used to be unable to perform syscall buffering
for syscalls performed inside a signal handler that interrupted a
syscall. This had performance implications on use cases like stack walkers,
which often perform multiple memory-probing system calls for every frame to
deal with the possibility of invalid unwind info.

There are many details here, but here's a high level overview. The layout of
the new extended jump patch is:

<stack setup>
call <syscallbuf_hook>
// Bail path returns here
<stack restore>
syscall
<code from the original patch site>
// Non-bail path returns here.
jmp return_addr

One detail worth mentioning is what happens if a signal gets delivered once
the tracee is out of the syscallbuf, but still in the extended jump patch
(i.e. after the stack restore). In this case, rr will rewrite the ip of the
signal frame to point to the equivalent ip in the original, now patched code
section. Of course the instructions in question are no longer there, but
the CFI will nevertheless be generally accurate for the current register state
(excluding weird CFI that explicitly references the ip of course).
This allows unwinders in the end-user-application to never have to unwind
through any frame in the rr syscallbuf, which seems like a desirable
property. Of course, sigreturn must perform the opposite transformation
to avoid actually returning into a patched-out location.

The main drawback of this scheme is that while the application will never
see a location without CFI, GDB does still lack unwind information in
the extended jump stub. This is not a new problem, but syscall events
are now in the extended jump stub, so they come up quite frequently.

I don't think this is a huge problem - it's basically the same situation
we used to have before the vdso changes. I believe the best way to fix
this would be to establish some way of having rr inform gdb of its
jump patches (in fact gdb already has this kind of mechanism for tracepoints,
it's just not exposed for tracepoints initiated by the gdb server),
but I don't intend to do this myself anytime in the near future.
That said, I should note that doing this would not require any changes
on the record side, so could be done anytime and start working retroactively
for already recorded traces.

@Keno
Copy link
Member Author

Keno commented Jul 8, 2022

Note that this is just the x86_64/i686 version of this. Aarch64 merged while I was working on this, so that's not included yet, but should be reasonably straightforward. There's also a little bit of cleanup still to do, but this should work 100% otherwise, so if there's any concerns about the details, we can discuss them now.

@Keno Keno force-pushed the kf/syscallbufredesign branch 3 times, most recently from 562ca4c to 7eb33ab Compare July 9, 2022 09:38
@Keno
Copy link
Member Author

Keno commented Jul 9, 2022

Alright, I think this is working now everywhere. @rocallahan @yuyichao, please take a look. As I said, there's a fair bit of additional code deletion that could be done as a result of this, but I think we can do that in a follow up PR.


// If the function requested the bail path, rewrite the return address
ldr x0, [sp, 688]
add x0, x0, 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we’d better not be using return address signing here…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is overwriting the parent frame right? Might be good to add a comment below around the stack frame layout to document this assumption.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we’d better not be using return address signing here…

Yeah, that'd probably need some other scheme

And this is overwriting the parent frame right? Might be good to add a comment below around the stack frame layout to document this assumption.

Yes, it's overwriting the return address in the parent frame.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that'd probably need some other scheme

I think just unsign and resign the printer would suffice. It was just funny that this is basically exactly a ROP attack on the caller...

* b .Lreturn
.Lbail:
* ldp x15, x30, [x15]
** // Safe suffix starts here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update the comment below since it’s not just used for the test anymore. It might make real syscall.

return remote_code_ptr(bp.as_int());
patched_syscall *ps = &syscall_stub_list[it->second];
auto bp = it->first + ps->size - ps->safe_suffix;
if (pp == bp - 4 || pp == bp - 8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I assume we won’t be expecting the exit breakpoint when we are in the bail path? (Otherwise this misses the bail path return)
  2. Is the breakpoint address returned here correct? It seems that it is pointing to the data part? (Safe suffix is larger than 8. I assume you meant to use the last instruction which should be - 12 due to the return address I stuffed at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The test failure that required this happens in cont_race about 1-3% of the time iirc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I wasn't entirely sure this breakpoint is actually still required, so I'll need to look into that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breakpoint is needed if we got a signal when we are in the return path of the stub and decided to defer it until the syscallbuf business is done. This would not be needed if we are grantee to not get a signal (comes to think about it I don't think we can grantee that for external signals) or if we just (manually or actually) single step the instructions out. It seems that we may need this on x86 as well now? Or at least a safe suffix so that we can deliver signal when we are back in the stub.

@rocallahan
Copy link
Collaborator

I think probably we should make a new rr release before we land this ... I expect this will destabilize things for a while.

@Keno
Copy link
Member Author

Keno commented Jul 12, 2022

I think probably we should make a new rr release before we land this ... I expect this will destabilize things for a while.

That's probably a good idea.

Keno added 3 commits July 17, 2022 02:55
This is a major redesign of the syscallbuf code with the goal of
establishing the invariant that we never switch away from a tracee
while it's in the syscallbuf code. Instead, we unwind the syscallbuf
code completely and execute the syscall at a special syscall instruction
now placed in the extended jump patch.

The primary motivation for this that this fixes #3285, but I think
the change is overall very beneficial. We have significant complexity
in the recorder to deal with the possibility of interrupting the tracee
during the syscallbuf. This commit does not yet remove this complexity
(the change is already very big), but that should be easy to do as a
follow up. Additionally, we used to be unable to perform syscall buffering
for syscalls performed inside a signal handler that interrupted a
syscall. This had performance implications on use cases like stack walkers,
which often perform multiple memory-probing system calls for every frame to
deal with the possibility of invalid unwind info.

There are many details here, but here's a high level overview. The layout of
the new extended jump patch is:

```
<stack setup>
call <syscallbuf_hook>
// Bail path returns here
<stack restore>
syscall
<code from the original patch site>
// Non-bail path returns here.
jmp return_addr
```

One detail worth mentioning is what happens if a signal gets delivered once
the tracee is out of the syscallbuf, but still in the extended jump patch
(i.e. after the stack restore). In this case, rr will rewrite the ip of the
signal frame to point to the equivalent ip in the original, now patched code
section. Of course the instructions in question are no longer there, but
the CFI will nevertheless be generally accurate for the current register state
(excluding weird CFI that explicitly references the ip of course).
This allows unwinders in the end-user-application to never have to unwind
through any frame in the rr syscallbuf, which seems like a desirable
property. Of course, `sigreturn` must perform the opposite transformation
to avoid actually returning into a patched-out location.

The main drawback of this scheme is that while the application will never
see a location without CFI, GDB does still lack unwind information in
the extended jump stub. This is not a new problem, but syscall events
are now in the extended jump stub, so they come up quite frequently.

I don't think this is a huge problem - it's basically the same situation
we used to have before the vdso changes. I believe the best way to fix
this would be to establish some way of having rr inform gdb of its
jump patches (in fact gdb already has this kind of mechanism for tracepoints,
it's just not exposed for tracepoints initiated by the gdb server),
but I don't intend to do this myself anytime in the near future.
That said, I should note that doing this would not require any changes
on the record side, so could be done anytime and start working retroactively
for already recorded traces.
This adds a test case to model #3285, where the test case pokes the sigframe
to force sigreturn to switch to a different function than that which incurred
the signal.
@Keno
Copy link
Member Author

Keno commented Jul 17, 2022

Rebased. Shall we go ahead with the release on master then? FWIW, I've used this a bit so far and have so far not found any new issues that came down to this change, but still an intermediate release is probably prudent.

@rocallahan
Copy link
Collaborator

Let's move forward with this! Can you rebase it? I promise I'll review it :-)

@cebtenzzre
Copy link

cebtenzzre commented Sep 29, 2023

I'm still interested in this change. I ran into "Expected syscall_bp_vm to be clear" today while debugging a python process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replay failed with "Expected syscall_bp_vm to be clear"
4 participants