Checkpoint/Forensic container checkpointing Feature Enhancement request #114591

jguionnet · 2022-12-19T23:06:17Z

What would you like to be added?

This feature is excellent; Could the following improvement be considered?

kubectl support so we can more easily automate a solution
need to support additional parameters to influence CRIU and to be able to store the image in a registry
-- We needed to pass the file-locks options for the feature to work for our app. See more details: How to pass the --file-locks option when checkpointing with the kubelet checkpoint-restore/criu#2018
make it a beta feature, so it is exposed in EKS. EKS does not support alpha features.

Why is this needed?

The use cases we are looking at are the following: We have monolith applications deployed on K8s. They start too slowly to enable reactive scaling and to implement scale to zero using Keda (for example). If we could checkpoint them, we could offer these options. To support these use cases, we need the above enhancements.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-12-19T23:06:26Z

@jguionnet: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-12-19T23:12:21Z

@jguionnet: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jguionnet · 2022-12-19T23:14:47Z

/sig node

pacoxu · 2022-12-20T07:44:31Z

/sig node
The KEP issue is in kubernetes/enhancements#2008.

Since [1], CRIU would fail if it finds file locks being used by the application that is being checkpointed and the --file-locks option has not been specified. This pull request enables checkpointing of file locks by default. Fixes: checkpoint-restore/criu#2018 Fixes: kubernetes/kubernetes#114591 [1] checkpoint-restore/criu#1357 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>

adrianreber · 2022-12-20T17:27:55Z

@jguionnet thanks for opening this. I would also like to see this feature move forward.

One of the main reasons for not yet having extended checkpointing to kubectl is related to the fact that with storing a checkpoint locally we have all memory pages on the local disk. All memory pages also means potentially secrets like private keys, random numbers and similar information. Our current approach to avoid problems related to leaking secrets and alike is to make the checkpoint only accessible by root. Being root enables access to any memory content of all running processes anyway so if the checkpoint is only accessible by root the situation should be the same. The big difference is that the checkpoint archive can easily be moved to another system and then the secrets might leak. Moving processes to another system to run as non-root is much more unlikely than moving a tar archive. The resulting checkpoint is protected just as well as the memory of the running process but it is more portable. I am definitely in favour of extending checkpoint support to kubectl but it presents a new situation which did not exist before in a way like it does now.

We have been discussing if encrypted images are a possible way to solve the problem of leaking secrets. But we are not sure at what level to encrypt. We think it could be implemented at the CRIU level but also on higher levels. One idea is to take the steps from https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/ to convert the local checkpoint archive to an OCI image using buildah and then with the encryption support in skopeo the image could be encrypted. Would be nice to have it all included in CRI-O. But there is also the discussion about defining a standard for checkpoint images (opencontainers/image-spec#962) which is not really moving forward.

About passing checkpoint parameters from Kubernetes to CRIU: This probably means to extend the CRI. Maybe having a generic string array to pass whatever is necessary to CRIU. Or a CRI entry for each option. Not difficult. The last we tried to add checkpoint support to the CRI took a really long time. Not sure if extending the existing functionality is easier.

Moving to beta. Cannot really say how easy that is. I think we should have some way of automatically removing checkpoint archives if there are more than a certain number to not fill up the local disk with checkpoint archives. Extending the existing test cases to actually create a checkpoint now that CRI-O has the necessary support would also be a good thing to do. Until now we are only testing the checkpoint code with the expectations that CRI implementation do not implement it. Now that CRI-O implements it the test cases could be extended.

I think most things you are looking for are not difficult from the implementation but there are still a few conceptual points which need to be discussed.

wenhuizhang · 2023-02-06T21:12:01Z

maybe traige the rootfs for "passwd file", and clean from there?

k8s-triage-robot · 2023-05-07T21:53:00Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-06-06T22:10:51Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-07-06T22:27:45Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-07-06T22:27:51Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jguionnet added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 19, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2022

jguionnet mentioned this issue Dec 19, 2022

How to pass the --file-locks option when checkpointing with the kubelet checkpoint-restore/criu#2018

Closed

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 20, 2022

rst0git mentioned this issue Dec 20, 2022

oci: Enable checkpointing of file locks cri-o/cri-o#6463

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 7, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 6, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint/Forensic container checkpointing Feature Enhancement request #114591

Checkpoint/Forensic container checkpointing Feature Enhancement request #114591

jguionnet commented Dec 19, 2022

k8s-ci-robot commented Dec 19, 2022

k8s-ci-robot commented Dec 19, 2022

jguionnet commented Dec 19, 2022 •

edited

pacoxu commented Dec 20, 2022

adrianreber commented Dec 20, 2022

wenhuizhang commented Feb 6, 2023

k8s-triage-robot commented May 7, 2023

k8s-triage-robot commented Jun 6, 2023

k8s-triage-robot commented Jul 6, 2023

k8s-ci-robot commented Jul 6, 2023

Checkpoint/Forensic container checkpointing Feature Enhancement request #114591

Checkpoint/Forensic container checkpointing Feature Enhancement request #114591

Comments

jguionnet commented Dec 19, 2022

What would you like to be added?

Why is this needed?

k8s-ci-robot commented Dec 19, 2022

k8s-ci-robot commented Dec 19, 2022

jguionnet commented Dec 19, 2022 • edited

pacoxu commented Dec 20, 2022

adrianreber commented Dec 20, 2022

wenhuizhang commented Feb 6, 2023

k8s-triage-robot commented May 7, 2023

k8s-triage-robot commented Jun 6, 2023

k8s-triage-robot commented Jul 6, 2023

k8s-ci-robot commented Jul 6, 2023

jguionnet commented Dec 19, 2022 •

edited