Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Make /proc/sys read-only with carve-outs for some sysctls #3518

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dgl
Copy link
Contributor

@dgl dgl commented Feb 13, 2024

As mentioned on #3511 this could be a more complete way to ensure systemd or other components don't change sysctls unexpectedly. This also makes sysfs mountable per #3436 (but that is just the mount of sysfs on /kind/private/sys, so can easily be split, aside from any naming preferences).

WIP as I'm not sure it's the best option, but possibly better than fragile breakage due to unexpected sysctl changes.

The downside is it needs an allow list of sysctls which is probably going to need additions for other use cases, but it does mean kind can be explicit about what is supported.

The workaround to add a sysctl as writable would be:

docker exec a-node mount --rbind /kind/private/proc/sys/some-sysctl /proc/sys/some-sysctl

(This currently won't support running in some userns configurations yet, but it should be a case of just ignoring the error from mount if it errors (it can work, it depends on the exact userns environment). In a user namespace the host's sysctls can't be modified anyway. I can test userns cases if this option is worth taking further.)

This mounts a read-write version of /proc and /sys under /kind/private,
which allows bind mounting and also makes use cases that need an
unmasked proc or sys possible.

/proc/sys is bind mounted read only per the systemd container
interface[1]. Then some sysctls are made writable again by bind mounting
across from the private /proc which was mounted.

This may cause issues for privileged daemonsets which set sysctls which
aren't namespaced (this may work anyway as often they set them to the
same value on multiple nodes). That can be worked around by adding
additional bind mounts via docker exec, making it clear kind can't
support such interfaces and they might leak from the container.

[1]: https://systemd.io/CONTAINER_INTERFACE/
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dgl
Once this PR has been reviewed and has the lgtm label, please assign aojea for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @dgl. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 13, 2024
if [[ -f /kind/private/proc/sys/"${mount_point}" ]]; then
mount --bind -o rw /kind/private/proc/sys/"${mount_point}" /proc/sys/"${mount_point}"
fi
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the robustness of this.
I think this should be opt-in.

log_info 'remounting /sys read-only'
# systemd-in-a-container should have read only /sys
# https://systemd.io/CONTAINER_INTERFACE/
# however, we need other things from `docker run --privileged` ...
# and this flag also happens to make /sys rw, amongst other things
#
# This step is ignored when running inside UserNS, because it fails with EACCES.
# This step is ignored when running inside UserNS, because it can fail with
# EACCES.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change

@AkihiroSuda
Copy link
Member

ensure systemd or other components don't change sysctls unexpectedly

Rootless mode ( https://kind.sigs.k8s.io/docs/user/rootless/ ) almost solves this issue.

@BenTheElder
Copy link
Member

As mentioned on #3511 this could be a more complete way to ensure systemd or other components don't change sysctls unexpectedly. This also makes sysfs mountable per #3436 (but that is just the mount of sysfs on /kind/private/sys, so can easily be split, aside from any naming preferences).

I'm really hesitant to ship a change like this because it's hard to say how we'll break users that have come to rely on this over the years and disabling something like udev/binfmt misc on the other hand is cheap and reasonable, at the risk of missing some future systemd behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants