Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: improving communication between squeezed and guest #5030

Open
ydirson opened this issue May 31, 2023 · 3 comments
Open

RFC: improving communication between squeezed and guest #5030

ydirson opened this issue May 31, 2023 · 3 comments

Comments

@ydirson
Copy link
Contributor

ydirson commented May 31, 2023

From what I understand of the way squeezed communicates with the guest, I feel a couple of points could be improved:

  • squeezed sets ~/memory/target for the in-guest balloon driver, and considers the request applied if totpages (from get_domain_info) matches the request modulo a calibration offset, and marks the domain's balloon as unresponsive if this does not happen in a given timeframe (5sec without any progress): this seems inefficient in itself (see below)
  • the calibration offset is computed as the delta between initial ~/memory/target value, and totpages at the time ~/control/feature-balloon is set: today the latter is created by the xe-daemon guest utility, which has no knowledge at all of the balloon driver state, so does not seem to offer any guarantee the balloon driver has indeed reached its target, so it would seem more correct to have this control node written by the ballooning driver instead

Thinking further from there, I'm wondering why squeezed really has to do this tracking by itself. Assuming that the balloon driver has less assumptions to make than squeezed to determine if it has reached its target, I'd think that something along those lines should do the job:

  • initially the toolstack creates a ~/control/ballooning entry as empty
  • when the driver gets notified of a change to ~/memory/target, it then sets ~/control/ballooning to 1 and starts working towards the new target
  • when the driver reaches the target, it sets ~/control/ballooning to 0

This way:

  • squeezed does not have do do any guesswork, or even assume that totpages - memory_target is fixed during a domain's lifetime
  • userspace guest tools need not be involved at all in a process they have no clue about
  • this provides a control interface that could be upstreamed at Xen level

Obviously it would need to modify both squeezed and the balloon drivers for various guest OS, and a migration path needs to be defined:

  • new toolstack / old guest driver: ~/control/ballooning stays empty, and toolstack can fallback to its previous behaviour
  • old toolstack / new guest driver/tools: the driver and guest tools will not see a ~/control/ballooning node, and can fallback to their previous behaviour
@psafont
Copy link
Member

psafont commented Jun 1, 2023

The general approach looks reasonable to me, I only have nitpicks:

  • Can this protocol be documented upstream? so that it's actually a standard and not just "it's implemented like this" It would be good to get a collection of protocols using xenstore for various features

  • What can a malicious guest agent do with this new protocol? I see a few edge cases here that are not made explicit in the proposal. Now the guest decides / reports when the ballooning has finished, is it a problem if this is reported before it has actually finished? What if it actually reports as ballooning but willingly never starts?

  • Is it worth following a more structured approach for xenstore keys? Since this adds new keys anyway I was thinking of creating the subtree ~/control/balloon for the new protocol. I do not know why some of the keys in the protocol are not under ~/control, is there anything stopping these from being recreated in the new subtree?

  • Is this a fair assumption? Is this true for the rust agent you're writing?

Assuming that the balloon driver has less assumptions to make than squeezed to determine if it has reached its target

@ydirson
Copy link
Contributor Author

ydirson commented Jun 1, 2023

The general approach looks reasonable to me, I only have nitpicks:

  • Can this protocol be documented upstream? so that it's actually a standard and not just "it's implemented like this" It would be good to get a collection of protocols using xenstore for various features

Yes this would be the best place for this protocol, I'm checking here first whether this would get traction, and to make sure I did not miss something crucial.

  • What can a malicious guest agent do with this new protocol? I see a few edge cases here that are not made explicit in the proposal. Now the guest decides / reports when the ballooning has finished, is it a problem if this is reported before it has actually finished? What if it actually reports as ballooning but willingly never starts?

Yes, blindly trusting the guest is likely not a good idea. Note the current squeezed behaviour already has a problem there, as it is already waiting for a signal that the guest reached the initial target, and guest signaling too early will yield a wrong toolstack calibration. We should likely keep the "timeout" logic on toolstack side, where a guest would be flagged as balloon-unresponsive

While this clearly needs more detailed thought, the maximum impact must not be worse than getting a non-ballooned guest.

  • Is it worth following a more structured approach for xenstore keys? Since this adds new keys anyway I was thinking of creating the subtree ~/control/balloon for the new protocol. I do not know why some of the keys in the protocol are not under ~/control, is there anything stopping these from being recreated in the new subtree?

I would guess ~/memory/target predates the standardization attempt? There are a few other entries in ~/memory/ that would fit in ~/control, and other internal toolstack-specific that could better live in something to be described under "Paths private to the toolstack". That could be a good occasion to suggest improving this too.

  • Is this a fair assumption? Is this true for the rust agent you're writing?

Assuming that the balloon driver has less assumptions to make than squeezed to determine if it has reached its target

I'm not sure if the questions refer to this statement - that assumption is about driver and squeezed. For now the prototype Rust agent does not attempt to do more than the current Go one - it is while pondering if something ought to be improved that I came to this more global proposal.

@ydirson
Copy link
Contributor Author

ydirson commented Jun 1, 2023

Another thing to dig: the squeezed logs make me suspicious that it's not exactly doing what the doc claims:

Jun  1 10:21:34 xcp-ng-bzkcpvhy squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- 1
Jun  1 10:21:37 xcp-ng-bzkcpvhy squeezed: [debug||3 ||squeeze_xen] domid 2 just started a guest agent (but has no balloon driver); calibrating memory-offset = 2036 KiB
Jun  1 10:21:37 xcp-ng-bzkcpvhy squeezed: [debug||4 ||squeeze_xen] watch /memory/memory-offset <- 2036
Jun  1 10:21:37 xcp-ng-bzkcpvhy squeezed: [debug||3 ||squeeze_xen] Xenctrl.domain_setmaxmem domid=2 max=4197364 (was=4231168)
Jun  1 10:21:59 xcp-ng-bzkcpvhy squeezed: [debug||4 ||squeeze_xen] watch /control/feature-balloon <- 1
Jun  1 10:22:00 xcp-ng-bzkcpvhy squeezed: [debug||4 ||squeeze_xen] watch /data/updated <- Thu Jun  1 10:21:59 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants