Save CRIU restore.log #7904

adrianreber · 2024-03-18T17:42:12Z

What type of PR is this?

/kind other

What this PR does / why we need it:

If there is an error restoring a container then there is usually an error message pointing to the log file created by CRIU. Unfortunately this log file is created in the bundle directory which is remove after container restore. The user gets told to look at a file which is automatically deleted.

In the case of an error the log file is now copied to a temporary directory:

os.CreateTemp("", fmt.Sprintf("restore-%s-*.log", ctr.ID()))

The resulting error message is also adapted to point to this copy of the log file.

This change comes with a test that tries to trigger a restore error by creating a container with an established TCP connection and telling CRIU to handle established TCP connection while not telling CRIU to handle established TCP connection during restore. The restore will fail and the test can verify that the reported log file copy exists.

The main reason for this change is that one of most asked questions about checkpoint/restore in Kubernetes/CRI-O is that the log file is not readable.

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

Does this PR introduce a user-facing change?

The CRIU log file (restore.log) is no longer deleted if the restore of the container fails. It will be copied to a temporary location.

openshift-ci · 2024-03-18T17:42:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adrianreber
Once this PR has been reviewed and has the lgtm label, please assign nalind for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

adrianreber · 2024-03-18T20:24:34Z

Hmm, I know why the unit tests are failing, but the make vendor change is unexpected.

kwilczynski · 2024-03-18T20:28:28Z

@adrianreber, is there a way to save this file where CRI-O saves its own log file at the moment?

I actually don't know.

adrianreber · 2024-03-19T07:29:18Z

@adrianreber, is there a way to save this file where CRI-O saves its own log file at the moment?

I actually don't know.

I am open to save the log file wherever it makes most sense.

codecov · 2024-03-19T11:29:25Z

Codecov Report

Attention: Patch coverage is 62.50000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 48.88%. Comparing base (93b05e5) to head (278dc47).
Report is 256 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7904      +/-   ##
==========================================
+ Coverage   48.87%   48.88%   +0.01%     
==========================================
  Files         152      152              
  Lines       16447    16462      +15     
==========================================
+ Hits         8038     8048      +10     
- Misses       7433     7436       +3     
- Partials      976      978       +2

haircommander · 2024-03-25T14:59:14Z

do you think it would be useful to log this in crio instead? I feel like it may be tough for an administrator to find this file

adrianreber · 2024-03-25T15:40:42Z

do you think it would be useful to log this in crio instead? I feel like it may be tough for an administrator to find this file

Do you mean dumping the complete content of the file in the CRI-O log?

Currently the location of the log file is part of the error message the kubelet gives the user.

Being part of CRI-O would also work. It is, however, a quite large log file. So, it would be a lot of data.

I am open to suggestions. Just trying to preserve the information in the case of an error.

internal/lib/restore.go

kwilczynski · 2024-03-26T06:12:17Z

@adrianreber, is there a way to save this file where CRI-O saves its own log file at the moment?
I actually don't know.

I am open to save the log file wherever it makes most sense.

My idea was to save the file to the exact location of the default CRI-O log file. However, CRI-O would log to the standard error as default rather than to any given log file, making it harder to log into the exact location. Unless someone provides a file path via the --log command-line option.

I suppose the temporary directory would suffice.

If there is an error restoring a container then there is usually an error message pointing to the log file created by CRIU. Unfortunately this log file is created in the bundle directory which is remove after container restore. The user gets told to look at a file which is automatically deleted. In the case of an error the log file is now copied to a temporary directory: os.CreateTemp("", fmt.Sprintf("restore-%s-*.log", ctr.ID())) The resulting error message is also adapted to point to this copy of the log file. This change comes with a test that tries to trigger a restore error by creating a container with an established TCP connection and telling CRIU to handle established TCP connection while not telling CRIU to handle established TCP connection during restore. The restore will fail and the test can verify that the reported log file copy exists. The main reason for this change is that one of most asked questions about checkpoint/restore in Kubernetes/CRI-O is that the log file is not readable. Signed-off-by: Adrian Reber <areber@redhat.com>

rst0git · 2024-04-09T11:48:03Z

@adrianreber Would it make sense to use something similar to the logCriuErrors functionality in runc?

adrianreber · 2024-04-09T12:08:41Z

@adrianreber Would it make sense to use something similar to the logCriuErrors functionality in runc?

This change will be part of runc 1.2.0 and will probably lead to a more useful error message runc passes to CRI-O and then to Kubernetes. For a complete problem analysis we still need the complete log file which will be deleted by CRI-O the way it is currently setup. The runc change will be helpful but not always enough.

adrianreber · 2024-04-09T12:16:15Z

/retest-required

adrianreber · 2024-04-09T13:57:11Z

/test ci-rhel-e2e

github-actions · 2024-05-10T00:03:01Z

A friendly reminder that this PR had no activity for 30 days.

adrianreber requested a review from mrunalp as a code owner March 18, 2024 17:42

openshift-ci bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 18, 2024

openshift-ci bot requested review from klihub and kwilczynski March 18, 2024 17:42

openshift-ci bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/other Categorizes issue or PR as not clearly related to any existing kind/* category labels Mar 18, 2024

adrianreber force-pushed the 2024-03-18-save-restore-log branch from 83d2723 to 1cb1c36 Compare March 19, 2024 09:21

kwilczynski reviewed Mar 26, 2024

View reviewed changes

internal/lib/restore.go Outdated Show resolved Hide resolved

adrianreber force-pushed the 2024-03-18-save-restore-log branch from 1cb1c36 to 278dc47 Compare March 26, 2024 09:58

github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save CRIU restore.log #7904

Save CRIU restore.log #7904

adrianreber commented Mar 18, 2024

openshift-ci bot commented Mar 18, 2024

adrianreber commented Mar 18, 2024

kwilczynski commented Mar 18, 2024

adrianreber commented Mar 19, 2024

codecov bot commented Mar 19, 2024 •

edited

haircommander commented Mar 25, 2024

adrianreber commented Mar 25, 2024

kwilczynski commented Mar 26, 2024

rst0git commented Apr 9, 2024

adrianreber commented Apr 9, 2024

adrianreber commented Apr 9, 2024

adrianreber commented Apr 9, 2024

github-actions bot commented May 10, 2024

Save CRIU restore.log #7904

Are you sure you want to change the base?

Save CRIU restore.log #7904

Conversation

adrianreber commented Mar 18, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

openshift-ci bot commented Mar 18, 2024

adrianreber commented Mar 18, 2024

kwilczynski commented Mar 18, 2024

adrianreber commented Mar 19, 2024

codecov bot commented Mar 19, 2024 • edited

Codecov Report

haircommander commented Mar 25, 2024

adrianreber commented Mar 25, 2024

kwilczynski commented Mar 26, 2024

rst0git commented Apr 9, 2024

adrianreber commented Apr 9, 2024

adrianreber commented Apr 9, 2024

adrianreber commented Apr 9, 2024

github-actions bot commented May 10, 2024

codecov bot commented Mar 19, 2024 •

edited