Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero #4618

Merged
merged 3 commits into from
May 31, 2024

Conversation

kalyazin
Copy link
Contributor

@kalyazin kalyazin commented May 17, 2024

Changes

This change introduces a workaround. If when taking a snapshot, we see a zero MSR_IA32_TSC_DEADLINE, we replace its value with the MSR_IA32_TSC value from the same vCPU to make sure the vCPU will continue to receive TSC interrupts.

Reason

On x86_64, we observed that when restoring from a snapshot, one of the vCPUs had MSR_IA32_TSC_DEADLINE cleared and never received TSC interrupts until the MSR is updated externally (eg by setting the system time).

We believe this happens because the TSC interrupt is lost during snapshot taking process: the MSR is cleared, but the interrupt is not delivered to the guest, so the guest does not rearm the timer.

A visible effect of that is failure to connect to a restored VM via SSH, similar to https://buildkite.com/firecracker/firecracker-pr-nightly/builds/1403#018f83db-5395-4656-8d9c-83b6fcfcfd54/50-1994 .

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • [ ] API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • [ ] New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented May 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.10%. Comparing base (3ce507f) to head (b222c18).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4618      +/-   ##
==========================================
+ Coverage   82.08%   82.10%   +0.01%     
==========================================
  Files         255      255              
  Lines       31258    31280      +22     
==========================================
+ Hits        25659    25681      +22     
  Misses       5599     5599              
Flag Coverage Δ
4.14-c5n.metal 79.59% <100.00%> (+0.01%) ⬆️
4.14-c7g.metal ?
4.14-m5n.metal 79.58% <100.00%> (+0.01%) ⬆️
4.14-m6a.metal 78.81% <100.00%> (+0.02%) ⬆️
4.14-m6g.metal 76.62% <ø> (ø)
4.14-m6i.metal 79.57% <100.00%> (+<0.01%) ⬆️
4.14-m7g.metal 76.62% <ø> (ø)
5.10-c5n.metal 82.11% <100.00%> (+0.01%) ⬆️
5.10-c7g.metal ?
5.10-m5n.metal 82.09% <100.00%> (+0.01%) ⬆️
5.10-m6a.metal 81.40% <100.00%> (+0.01%) ⬆️
5.10-m6g.metal 79.40% <ø> (ø)
5.10-m6i.metal 82.09% <100.00%> (+<0.01%) ⬆️
5.10-m7g.metal 79.39% <ø> (-0.01%) ⬇️
6.1-c5n.metal 82.11% <100.00%> (+0.01%) ⬆️
6.1-c7g.metal ?
6.1-m5n.metal 82.09% <100.00%> (+<0.01%) ⬆️
6.1-m6a.metal 81.40% <100.00%> (+0.01%) ⬆️
6.1-m6g.metal 79.39% <ø> (-0.01%) ⬇️
6.1-m6i.metal 82.09% <100.00%> (+<0.01%) ⬆️
6.1-m7g.metal 79.39% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kalyazin kalyazin force-pushed the fix_tsc_deadline branch 3 times, most recently from 40938cb to d31bcc2 Compare May 17, 2024 16:27
@kalyazin kalyazin changed the title [WIP] fix(snapshot/x86_64): make sure TSC_DEADLINE is non-zero [WIP] fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero May 17, 2024
@kalyazin kalyazin marked this pull request as ready for review May 17, 2024 17:28
@kalyazin kalyazin changed the title [WIP] fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero fix(snapshot/x86_64): make sure TSC_DEADLINE MSR is non-zero May 17, 2024
@kalyazin kalyazin self-assigned this May 17, 2024
@kalyazin kalyazin added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label May 17, 2024
docs/snapshotting/snapshot-support.md Outdated Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
@kalyazin kalyazin force-pushed the fix_tsc_deadline branch 4 times, most recently from d7efcff to 58e73cf Compare May 28, 2024 13:18
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
pb8o
pb8o previously approved these changes May 28, 2024
src/vmm/src/vstate/vcpu/x86_64.rs Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
src/vmm/src/vstate/vcpu/x86_64.rs Outdated Show resolved Hide resolved
On x86_64, we observed that when restoring from a snapshot,
one of the vCPUs had MSR_IA32_TSC_DEADLINE cleared and never
received TSC interrupts until the MSR is updated externally
(eg by setting the system time).

We believe this happens because the TSC interrupt is lost
during snapshot taking process: the MSR is cleared, but the
interrupt is not delivered to the guest, so the guest
does not rearm the timer.

A visible effect of that is failure to connect to a restored VM
via SSH.

This commit introduces a workaround. If when taking a snapshot,
we see a zero MSR_IA32_TSC_DEADLINE, we replace its value with
the MSR_IA32_TSC value from the same vCPU to make sure that
the vCPU will continue to receive TSC interrupts.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The TSC_DEADLINE MSR value is volatile is it is getting updated
by the guest kernel based on the current TSC value.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The TSC_DEADLINE MSR value is volatile is it is getting updated
by the guest kernel based on the current TSC value.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
@ShadowCurse ShadowCurse merged commit cee34ab into firecracker-microvm:main May 31, 2024
7 checks passed
@kalyazin kalyazin deleted the fix_tsc_deadline branch May 31, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants