Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint/Forensic container checkpointing Feature Enhancement request #114591

Closed
jguionnet opened this issue Dec 19, 2022 · 10 comments
Closed

Checkpoint/Forensic container checkpointing Feature Enhancement request #114591

jguionnet opened this issue Dec 19, 2022 · 10 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@jguionnet
Copy link

What would you like to be added?

This feature is excellent; Could the following improvement be considered?

Why is this needed?

The use cases we are looking at are the following: We have monolith applications deployed on K8s. They start too slowly to enable reactive scaling and to implement scale to zero using Keda (for example). If we could checkpoint them, we could offer these options. To support these use cases, we need the above enhancements.

@jguionnet jguionnet added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 19, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2022
@k8s-ci-robot
Copy link
Contributor

@jguionnet: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@jguionnet: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jguionnet
Copy link
Author

jguionnet commented Dec 19, 2022

/sig node

@pacoxu
Copy link
Member

pacoxu commented Dec 20, 2022

/sig node
The KEP issue is in kubernetes/enhancements#2008.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 20, 2022
rst0git added a commit to rst0git/cri-o that referenced this issue Dec 20, 2022
Since [1], CRIU would fail if it finds file locks being used by the
application that is being checkpointed and the --file-locks option
has not been specified. This pull request enables checkpointing
of file locks by default.

Fixes: checkpoint-restore/criu#2018
Fixes: kubernetes/kubernetes#114591

[1] checkpoint-restore/criu#1357

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
@adrianreber
Copy link
Contributor

@jguionnet thanks for opening this. I would also like to see this feature move forward.

One of the main reasons for not yet having extended checkpointing to kubectl is related to the fact that with storing a checkpoint locally we have all memory pages on the local disk. All memory pages also means potentially secrets like private keys, random numbers and similar information. Our current approach to avoid problems related to leaking secrets and alike is to make the checkpoint only accessible by root. Being root enables access to any memory content of all running processes anyway so if the checkpoint is only accessible by root the situation should be the same. The big difference is that the checkpoint archive can easily be moved to another system and then the secrets might leak. Moving processes to another system to run as non-root is much more unlikely than moving a tar archive. The resulting checkpoint is protected just as well as the memory of the running process but it is more portable. I am definitely in favour of extending checkpoint support to kubectl but it presents a new situation which did not exist before in a way like it does now.

We have been discussing if encrypted images are a possible way to solve the problem of leaking secrets. But we are not sure at what level to encrypt. We think it could be implemented at the CRIU level but also on higher levels. One idea is to take the steps from https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/ to convert the local checkpoint archive to an OCI image using buildah and then with the encryption support in skopeo the image could be encrypted. Would be nice to have it all included in CRI-O. But there is also the discussion about defining a standard for checkpoint images (opencontainers/image-spec#962) which is not really moving forward.

About passing checkpoint parameters from Kubernetes to CRIU: This probably means to extend the CRI. Maybe having a generic string array to pass whatever is necessary to CRIU. Or a CRI entry for each option. Not difficult. The last we tried to add checkpoint support to the CRI took a really long time. Not sure if extending the existing functionality is easier.

Moving to beta. Cannot really say how easy that is. I think we should have some way of automatically removing checkpoint archives if there are more than a certain number to not fill up the local disk with checkpoint archives. Extending the existing test cases to actually create a checkpoint now that CRI-O has the necessary support would also be a good thing to do. Until now we are only testing the checkpoint code with the expectations that CRI implementation do not implement it. Now that CRI-O implements it the test cases could be extended.

I think most things you are looking for are not difficult from the implementation but there are still a few conceptual points which need to be discussed.

@wenhuizhang
Copy link

maybe traige the rootfs for "passwd file", and clean from there?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

6 participants