Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipe: Fall back to write() on vmsplice() EPERM #2294

Open
wants to merge 1 commit into
base: criu-dev
Choose a base branch
from

Conversation

ymanton
Copy link
Contributor

@ymanton ymanton commented Oct 30, 2023

vmsplice can be blocked via seccomp and currently is in Podman containers. CRIU uses vmsplice a lot on the checkpoint side, so it's easier to just run with seccomp=unconfined in that case. On the restore side I've only seen one use in restore_pipe_data being reached, so rather than forcing users to run with a custom profile or disable seccomp altogether it might be better to just fall back to calling write, which this patch does.

vmsplice can be blocked via seccomp and currently is in Podman
containers.

Signed-off-by: Younes Manton <ymanton@ca.ibm.com>
@0x7f454c46
Copy link
Member

Hm, this seems like a workaround for a specific environment setup, rather than for a kernel code.
Unsure if this should be hardcoded like this. Maybe an option in criu.config?
This will be a policy/config-based decision and in result CRIU won't try hard on vmsplice() potentially causing notifications and issuing syscalls that are expected to fail, decreasing the performance as well.

@avagin
Copy link
Member

avagin commented Oct 31, 2023

@ymanton I look at the list [1] and can't firgure out why they decide to block vmsplice. It looks like a mistake.
[1] https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json#L84

@avagin
Copy link
Member

avagin commented Oct 31, 2023

containers/common@7ced5da

@ymanton
Copy link
Contributor Author

ymanton commented Nov 3, 2023

containers/common@7ced5da

Thanks.

The discussion on lkml is very long and deep in kernel internals that I don't understand, but from what I can gather there is some worry that vmsplice can be abused, and I guess the Podman folks have blocked it to be on the safe side.

https://lore.kernel.org/linux-mm/X+Kxy3oBMSLz8Eaq@redhat.com/:

The problem is we have an unprivileged long term GUP like vmsplice
that facilitates elevating the page count indefinitely, until the
parent finally writes a secret to it. Theoretically a short term pin
would do it too so it's not just vmpslice, but the short term pin
would be incredibly more challenging to become a concern since it'd
kill a phone battery and flash before it can read any data.

So what happens with your page_mapcount == 1 check is that it doesn't
mean non-COW (we thought it did until it didn't for the long term gup
pin in vmsplice).

Jann's testcases does fork() and set page_mapcount 2 and page_count to
2, vmsplice, take unprivileged infinitely long GUP pin to set
page_count to 3, queue the page in the pipe with page_count elevated,
munmap to drop page_count to 2 and page_mapcount to 1.

page_mapcount is 1, so you'd think the page is non-COW and owned by
the parent, but the child can still read it so it's very much still
wp_page_copy material if the parent tries to modify it. Otherwise the
child can read the content.

https://lore.kernel.org/linux-mm/X+PoXCizo392PBX7@redhat.com/:

I already told Jens, is io_uring could use mmu notifier already (that
would make any io_uring GUP pin not count anymore with regard to
page_mapcount vs page_count difference) and vmsplice also needs some
handling or maybe become privileged.

So unless you guys think otherwise I doubt we will convince the Podman folks to unblock it any time soon, and some others might follow in their footsteps.

I don't mind making it a config option. Do we want an option that avoids all uses of vmsplice on both checkpoint and restore? Or should we consider just getting rid of it altogether?

@Snorch
Copy link
Member

Snorch commented Nov 6, 2023

Can we in CRIU temporary disable this seccomp rule while restoring (e.g. only for restored container) or even restore seccomp rules later after vmsplicing everything we need to?

@avagin
Copy link
Member

avagin commented Nov 6, 2023

@Snorch I think it is about unprivileged C/R. In this case, we can't suspend seccomp.

Copy link

github-actions bot commented Dec 7, 2023

A friendly reminder that this PR had no activity for 30 days.

@avagin avagin added no-auto-close Don't auto-close as a stale issue and removed stale-pr labels Jan 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants