Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save CRIU restore.log #7904

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adrianreber
Copy link
Member

What type of PR is this?

/kind other

What this PR does / why we need it:

If there is an error restoring a container then there is usually an error message pointing to the log file created by CRIU. Unfortunately this log file is created in the bundle directory which is remove after container restore. The user gets told to look at a file which is automatically deleted.

In the case of an error the log file is now copied to a temporary directory:

os.CreateTemp("", fmt.Sprintf("restore-%s-*.log", ctr.ID()))

The resulting error message is also adapted to point to this copy of the log file.

This change comes with a test that tries to trigger a restore error by creating a container with an established TCP connection and telling CRIU to handle established TCP connection while not telling CRIU to handle established TCP connection during restore. The restore will fail and the test can verify that the reported log file copy exists.

The main reason for this change is that one of most asked questions about checkpoint/restore in Kubernetes/CRI-O is that the log file is not readable.

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

Does this PR introduce a user-facing change?

The CRIU log file (restore.log) is no longer deleted if the restore of the container fails. It will be copied to a temporary location.

@openshift-ci openshift-ci bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 18, 2024
@openshift-ci openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/other Categorizes issue or PR as not clearly related to any existing kind/* category labels Mar 18, 2024
Copy link
Contributor

openshift-ci bot commented Mar 18, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adrianreber
Once this PR has been reviewed and has the lgtm label, please assign nalind for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@adrianreber
Copy link
Member Author

Hmm, I know why the unit tests are failing, but the make vendor change is unexpected.

@kwilczynski
Copy link
Member

@adrianreber, is there a way to save this file where CRI-O saves its own log file at the moment?

I actually don't know.

@adrianreber
Copy link
Member Author

@adrianreber, is there a way to save this file where CRI-O saves its own log file at the moment?

I actually don't know.

I am open to save the log file wherever it makes most sense.

Copy link

codecov bot commented Mar 19, 2024

Codecov Report

Attention: Patch coverage is 62.50000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 48.88%. Comparing base (93b05e5) to head (278dc47).
Report is 256 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7904      +/-   ##
==========================================
+ Coverage   48.87%   48.88%   +0.01%     
==========================================
  Files         152      152              
  Lines       16447    16462      +15     
==========================================
+ Hits         8038     8048      +10     
- Misses       7433     7436       +3     
- Partials      976      978       +2     

@haircommander
Copy link
Member

do you think it would be useful to log this in crio instead? I feel like it may be tough for an administrator to find this file

@adrianreber
Copy link
Member Author

do you think it would be useful to log this in crio instead? I feel like it may be tough for an administrator to find this file

Do you mean dumping the complete content of the file in the CRI-O log?

Currently the location of the log file is part of the error message the kubelet gives the user.

Being part of CRI-O would also work. It is, however, a quite large log file. So, it would be a lot of data.

I am open to suggestions. Just trying to preserve the information in the case of an error.

@kwilczynski
Copy link
Member

@adrianreber, is there a way to save this file where CRI-O saves its own log file at the moment?
I actually don't know.

I am open to save the log file wherever it makes most sense.

My idea was to save the file to the exact location of the default CRI-O log file. However, CRI-O would log to the standard error as default rather than to any given log file, making it harder to log into the exact location. Unless someone provides a file path via the --log command-line option.

I suppose the temporary directory would suffice.

If there is an error restoring a container then there is usually an
error message pointing to the log file created by CRIU. Unfortunately
this log file is created in the bundle directory which is remove after
container restore. The user gets told to look at a file which is
automatically deleted.

In the case of an error the log file is now copied to a temporary
directory:

  os.CreateTemp("", fmt.Sprintf("restore-%s-*.log", ctr.ID()))

The resulting error message is also adapted to point to this copy of the
log file.

This change comes with a test that tries to trigger a restore error by
creating a container with an established TCP connection and telling CRIU
to handle established TCP connection while not telling CRIU to handle
established TCP connection during restore. The restore will fail and the
test can verify that the reported log file copy exists.

The main reason for this change is that one of most asked questions
about checkpoint/restore in Kubernetes/CRI-O is that the log file is not
readable.

Signed-off-by: Adrian Reber <areber@redhat.com>
@rst0git
Copy link
Contributor

rst0git commented Apr 9, 2024

@adrianreber Would it make sense to use something similar to the logCriuErrors functionality in runc?

@adrianreber
Copy link
Member Author

@adrianreber Would it make sense to use something similar to the logCriuErrors functionality in runc?

This change will be part of runc 1.2.0 and will probably lead to a more useful error message runc passes to CRI-O and then to Kubernetes. For a complete problem analysis we still need the complete log file which will be deleted by CRI-O the way it is currently setup. The runc change will be helpful but not always enough.

@adrianreber
Copy link
Member Author

/retest-required

@adrianreber
Copy link
Member Author

/test ci-rhel-e2e

Copy link

A friendly reminder that this PR had no activity for 30 days.

@github-actions github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/other Categorizes issue or PR as not clearly related to any existing kind/* category release-note Denotes a PR that will be considered when it comes time to generate release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants