Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmm: Support passing Net FDs to Restore #6402

Merged
merged 7 commits into from May 14, 2024

Conversation

pupacha
Copy link
Contributor

@pupacha pupacha commented Apr 21, 2024

'NetConfig' FDs, when explicitly passed via SCM_RIGHTS during VM creation, are marked as invalid during snapshot. See: #6332. So, Restore should support input for the new net FDs. This patch adds two new fields to 'RestoreConfig' - 1.net_ids 2.net_fds. 'net_ids' is a list of NetConfig id. 'net_fds' is a list of FDs for required NetConfigs. These fds are replaced into the fds field of NetConfig appropriately.

Implement 'validate' for RestoreConfig
Allow net FDs to be sent along with 'restore' in ch-remote
Add new macro 'vm_action_put_handler_body_with_fds'; to be used by VmRestore and VmAddNet

tests: Add test_snapshot_restore_with_fd to integration tests
tests: Reduce Guest memory in _test_snapshot_restore
vmm: Add unit tests for restore

Fixes: #6286

@pupacha pupacha requested a review from a team as a code owner April 21, 2024 19:31
@jinankjain
Copy link
Member

@pupacha Can you please include an integration test to test this feature?

vmm/src/config.rs Outdated Show resolved Hide resolved
Copy link
Member

@rbradford rbradford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs an integration test!

@pupacha
Copy link
Contributor Author

pupacha commented Apr 22, 2024

@jinankjain @rbradford Added integration test.

@rbradford could you please approve the PR workflow/checks. I dont see an option to trigger them.

@liuw
Copy link
Member

liuw commented Apr 22, 2024

@pupacha I think some of the commits are missing your SoB.

@pupacha pupacha force-pushed the pavan/issue_6286 branch 3 times, most recently from 95cff5e to c5053d0 Compare April 22, 2024 20:51
@pupacha
Copy link
Contributor Author

pupacha commented Apr 22, 2024

@pupacha I think some of the commits are missing your SoB.

My name is too long and [Check commit messages] complained about line being more than 79 char in commit description. When I split SoB into multiple lines, DCO Check threw mismatch error. I had to redo the commits with trimmed down name.

tests/integration.rs Outdated Show resolved Hide resolved
@pupacha pupacha force-pushed the pavan/issue_6286 branch 2 times, most recently from d6cacc4 to 237d19d Compare April 23, 2024 07:37
vmm/src/config.rs Outdated Show resolved Hide resolved
vmm/src/config.rs Outdated Show resolved Hide resolved
@rbradford
Copy link
Member

@pupacha I think some of the commits are missing your SoB.

My name is too long and [Check commit messages] complained about line being more than 79 char in commit description. When I split SoB into multiple lines, DCO Check threw mismatch error. I had to redo the commits with trimmed down name.

The commit message length check is advisory only - don't worry about that - we regularly exceed it with long names or copy and pasted errors but it's there to prevent commits from where the body is not wrapped.

@pupacha pupacha force-pushed the pavan/issue_6286 branch 3 times, most recently from 45d4fc5 to f40c6ff Compare April 23, 2024 15:00
@pupacha
Copy link
Contributor Author

pupacha commented Apr 23, 2024

Some integration tests fail with "No space left on device (os error 28)". Could anyone please help me with retriggering this failed job https://github.com/cloud-hypervisor/cloud-hypervisor/actions/runs/8802672511/job/24159343723?pr=6402 ?

@liuw
Copy link
Member

liuw commented Apr 24, 2024

It seems that the CI is a bit flaky. The integration tests failed a few times, but in different places which don't seem to be related to the changes here.

@pupacha
Copy link
Contributor Author

pupacha commented Apr 24, 2024

It seems that the CI is a bit flaky. The integration tests failed a few times, but in different places which don't seem to be related to the changes here.

Yes, I see different tests failing for different runs. But most often I see test_snapshot_restore_hotplug_virtiomem failing with No space left on device (os error 28) error. This could be because new snapshot_restore_with_fds test (which could be running simultaneously). I will try adding a step in these tests to cleanup the snapshot folder after restore.

@pupacha pupacha force-pushed the pavan/issue_6286 branch 3 times, most recently from 73d4b69 to e04b41a Compare April 24, 2024 18:53
@pupacha pupacha force-pushed the pavan/issue_6286 branch 7 times, most recently from a549322 to c15c796 Compare May 9, 2024 06:12
@rbradford rbradford marked this pull request as draft May 9, 2024 06:50
@pupacha pupacha force-pushed the pavan/issue_6286 branch 3 times, most recently from 9ddd3d6 to 7a0b3ac Compare May 9, 2024 13:07
@rbradford
Copy link
Member

I think it should just be one commit moving the tests.

@pupacha
Copy link
Contributor Author

pupacha commented May 9, 2024

I think it should just be one commit moving the tests.

I commented about it somewhere up in the conversations. Copying it here as well.

I went ahead and moved all the snapshot/restore tests to common_sequential. No flakiness is noticed during the test runs. But git does not show the diff as "code chunk removed + code chunk added", it would mess up the diff and blame, see 9ddd3d6
So, I had to go about a slightly ugly way of splitting the tests move into 2 commits -

  1. remove tests from commom_parallel
  2. add tests to common_sequential

Please let me know if you have any other suggestions on this, @rbradford

@rbradford rbradford marked this pull request as ready for review May 9, 2024 16:10
@rbradford
Copy link
Member

Adds ~ 4 minutes to the run - but maybe we can shave the time off elsewhere (and ultimately if it's more reliable then we won't have to more builds to deal with flaky tests.) As you probably realised we're right at the limit so adding a new test can push the load/resource consumption up such than an unrelated test will start to fail!

@rbradford rbradford added this pull request to the merge queue May 9, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 9, 2024
pupacha and others added 7 commits May 9, 2024 17:06
'NetConfig' FDs, when explicitly passed via SCM_RIGHTS during VM
creation, are marked as invalid during snapshot. See: cloud-hypervisor#6332.
So, Restore should support input for the new net FDs. This patch adds
new field 'net_fds' to 'RestoreConfig'. The FDs passed using this new
field are replaced into the 'fds' field of NetConfig appropriately.

The 'validate()' function ensures all net devices from 'VmConfig' backed
by FDs have a corresponding 'RestoreNetConfig' with a matched 'id' and
expected number of FDs.

The unit tests provide different inputs to parse and validate functions
to make sure parsing and error handling is as per expectation.

Fixes cloud-hypervisor#6286

Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
Co-authored-by: Bo Chen <chen.bo@intel.com>
Consume FDs passed via SCM_RIGHTs to VmRestore API and assign them
appropriately to RestoredNetConfig's fds field.

Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
Enable restore command the ability to send file descriptors along with
HTTP request. This is useful when restoring a VM with explicit FDs
passed to NetConfig(s).

Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
Add a section about restoring VM with new Net FDs explicitly passed to
ch-remote via 'net_fds' parameter

Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
VM is created with FDs explicitly passed to CH via --net parameter
and snapshotted. New net FDs are passed in turn during restore.
Boilerplate code from _test_snapshot_restore().

Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
test_snapshot_restore_* tests often have transient failures and add to
overall flakiness of the integration testsuite. Hence, remove them from
common_parallel. However, these tests need to be added back to
common_sequential

Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
tests_snapshot_restore* have been earlier removed from common_parallel
due to the falkiness they add testsuite. Running them sequentially would
eliminate the flakiness. Hence, add the tests back to testsuite but into
common_sequential module.

Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
@pupacha
Copy link
Contributor Author

pupacha commented May 9, 2024

@rbradford Just rebased and pushed.

@rbradford rbradford added this pull request to the merge queue May 14, 2024
Merged via the queue into cloud-hypervisor:main with commit fe86f90 May 14, 2024
32 checks passed
@pupacha
Copy link
Contributor Author

pupacha commented May 14, 2024

Thanks @rbradford and @likebreath for the constant feedback and helping in landing these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Restoring VM fails when using explicitly passed FDs
6 participants