New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save CRIU restore.log #7904
base: main
Are you sure you want to change the base?
Save CRIU restore.log #7904
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: adrianreber The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hmm, I know why the unit tests are failing, but the |
@adrianreber, is there a way to save this file where CRI-O saves its own log file at the moment? I actually don't know. |
I am open to save the log file wherever it makes most sense. |
83d2723
to
1cb1c36
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7904 +/- ##
==========================================
+ Coverage 48.87% 48.88% +0.01%
==========================================
Files 152 152
Lines 16447 16462 +15
==========================================
+ Hits 8038 8048 +10
- Misses 7433 7436 +3
- Partials 976 978 +2 |
do you think it would be useful to log this in crio instead? I feel like it may be tough for an administrator to find this file |
Do you mean dumping the complete content of the file in the CRI-O log? Currently the location of the log file is part of the error message the kubelet gives the user. Being part of CRI-O would also work. It is, however, a quite large log file. So, it would be a lot of data. I am open to suggestions. Just trying to preserve the information in the case of an error. |
My idea was to save the file to the exact location of the default CRI-O log file. However, CRI-O would log to the standard error as default rather than to any given log file, making it harder to log into the exact location. Unless someone provides a file path via the I suppose the temporary directory would suffice. |
If there is an error restoring a container then there is usually an error message pointing to the log file created by CRIU. Unfortunately this log file is created in the bundle directory which is remove after container restore. The user gets told to look at a file which is automatically deleted. In the case of an error the log file is now copied to a temporary directory: os.CreateTemp("", fmt.Sprintf("restore-%s-*.log", ctr.ID())) The resulting error message is also adapted to point to this copy of the log file. This change comes with a test that tries to trigger a restore error by creating a container with an established TCP connection and telling CRIU to handle established TCP connection while not telling CRIU to handle established TCP connection during restore. The restore will fail and the test can verify that the reported log file copy exists. The main reason for this change is that one of most asked questions about checkpoint/restore in Kubernetes/CRI-O is that the log file is not readable. Signed-off-by: Adrian Reber <areber@redhat.com>
1cb1c36
to
278dc47
Compare
@adrianreber Would it make sense to use something similar to the logCriuErrors functionality in runc? |
This change will be part of runc 1.2.0 and will probably lead to a more useful error message runc passes to CRI-O and then to Kubernetes. For a complete problem analysis we still need the complete log file which will be deleted by CRI-O the way it is currently setup. The runc change will be helpful but not always enough. |
/retest-required |
/test ci-rhel-e2e |
A friendly reminder that this PR had no activity for 30 days. |
What type of PR is this?
/kind other
What this PR does / why we need it:
If there is an error restoring a container then there is usually an error message pointing to the log file created by CRIU. Unfortunately this log file is created in the bundle directory which is remove after container restore. The user gets told to look at a file which is automatically deleted.
In the case of an error the log file is now copied to a temporary directory:
os.CreateTemp("", fmt.Sprintf("restore-%s-*.log", ctr.ID()))
The resulting error message is also adapted to point to this copy of the log file.
This change comes with a test that tries to trigger a restore error by creating a container with an established TCP connection and telling CRIU to handle established TCP connection while not telling CRIU to handle established TCP connection during restore. The restore will fail and the test can verify that the reported log file copy exists.
The main reason for this change is that one of most asked questions about checkpoint/restore in Kubernetes/CRI-O is that the log file is not readable.
Which issue(s) this PR fixes:
None
Special notes for your reviewer:
Does this PR introduce a user-facing change?