Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move init phase of KPR configuration outside of Hive Runtime #32216

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tommyp1ckles
Copy link
Contributor

@tommyp1ckles tommyp1ckles commented Apr 28, 2024

Please see commit message for detailed explanation.

move part of kube-proxy-replacement configuration init logic to daemon initialization phase to prevent incorrect node-port-enabled usage.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 28, 2024
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles force-pushed the pr/tp/final-daemon-config-type branch 2 times, most recently from e8ce697 to d490771 Compare April 28, 2024 17:17
@tommyp1ckles tommyp1ckles changed the title Pr/tp/final daemon config type Move init phase of KPR configuration outside of Hive Runtime Apr 28, 2024
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 28, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 28, 2024
@tommyp1ckles tommyp1ckles added the release-note/misc This PR makes changes that have no direct user impact. label Apr 28, 2024
@tommyp1ckles tommyp1ckles marked this pull request as ready for review April 30, 2024 00:45
@tommyp1ckles tommyp1ckles requested review from a team as code owners April 30, 2024 00:45
@tommyp1ckles tommyp1ckles marked this pull request as draft April 30, 2024 01:01
@tommyp1ckles tommyp1ckles added the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Apr 30, 2024
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/final-daemon-config-type branch from d490771 to e77a166 Compare May 1, 2024 23:39
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label May 1, 2024
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/final-daemon-config-type branch from e77a166 to 3a17272 Compare May 2, 2024 04:41
When creating a new daemon with newDaemon, we perform various procedures
to detect kpr config and subsequently override various config options, if
necessary.

Unlike most provided dependencies, newDaemon is created inside hive
lifecycle and exposed as a dependeny inside a promise.

This presents a problem when trying to write new modules that use the
global *option.DaemonConfig. Although this type is exposed to hive via
a Provide(...) call, some fields may change after the daemon has
started so attempting to use the config during the populate phase can
cause incorrect configuration to be used.
Notably fields such as option.Config.EnableNodePort may change at runtime.

Trying to solve this by forcing a dependency on the daemon promise
fails in two ways:

  i.  Modules that require a finalized *DaemonConfig while being
      initialized cannot wait for promise.Promise[*Daemon] to resolve
      as this would cause a deadlock waiting for a runtime dependency.

  ii. The map group provided using bpf.MapOut[T](...) has dependency on
      daemonPromise (these are synced before the loader dependency).
      So trying to depend on this would cause a Hive dependency cycle.

Most of the kpr config init can be done prior to runtime, as finalizing
configuration seems to be a reasonable thing to do in the Hive init
(i.e. Populate(...)) phase.

This moves all such work out of newDaemon and outside of hives lifecycle
hooks while leaving kpr probing out of the init function and leaving it
inside the newDaemon func.

In order to all syncing on this procedure, the *option.DaemonConfig is
re-provided as the alias *option.FinalDaemonConfig to allow dependencies

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
The MTU cell relies on option.Config.NodePortEnabled which as
described in commit c2d65fe61a9231ee961097f894466727a2438768 is
prone to a race where the daemon changes this configuration at runtime.

This uses the newly introduced option.(*FinalDaemonConfig) type as a
parameter dependency to prevent such issues with MTU.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
@tommyp1ckles tommyp1ckles force-pushed the pr/tp/final-daemon-config-type branch from 3a17272 to 34c0202 Compare May 2, 2024 04:43
@tommyp1ckles
Copy link
Contributor Author

/test

@tommyp1ckles tommyp1ckles marked this pull request as ready for review May 2, 2024 05:43
@tommyp1ckles tommyp1ckles requested a review from a team as a code owner May 2, 2024 05:43
Copy link
Contributor

@derailed derailed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tommyp1ckles Nice work Tom!

@@ -1356,6 +1356,14 @@ func LogRegisteredOptions(vp *viper.Viper, entry *logrus.Entry) {
}
}

// FinalDaemonConfig is a alias for DaemonConfig used to differentiate what stage of daemon option init
// the config is when it is provided.
type FinalDaemonConfig DaemonConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the intent for this. It's a parallel type and not an alias and it wraps an exported field. Is there an interface at play here?

Copy link
Contributor

@joamaki joamaki May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more complicated than it needs to be. Could we instead just use the DaemonConfig type and have the provider for it also do the KPR etc. init/validation? (daemon/cmd/cells.go:91).

I'm worried that things outside NewDaemon might also want to use this flag and they might not be ordered correctly if they don't use FinalDaemonConfig. So seems safer to just have one type whose provider correctly initializes it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derailed The aliased type allows us to "re-provide" the same var (i.e. *option.DaemonConfig) but in a later state of initialization. That is, we Provide DaemonConfig to hive for convenience but the underlying data is a global variable in pkg/option, this var is modified at various points at both init/runtime for things like overriding required config for KPR.

This presents the problem that this breaks assumptions about Hive dependency ordering, as well as having configuration that cannot be finalized prior to runtime.

So by providing the FinalDaemonConfig after the kpr init stuff, we provide a synchronization point to Hive where we know that the KPR init has already occured.

i.e.:

// Arbitrary ordering, KPR init may not be complete prior to this: 
cell.Provide(func(_ *option.DaemonConfig) {})

// This depends on KPR init, so using fields mutated by that routine is "safe": 
cell.Provide(func(_ *option.FinalDaemonConfig) {})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joamaki this was the approach I tried initially, albeit with a "pre-init" DaemonConfig alias, but since this is global we might be able to just use the global var there and then provide DaemonConfig as a "post-kpr-init" state as the primary dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Final" part of this is actually misleading because this is only "final" for the KPR related fields so that might cause confusion - so I think maybe reworking this makes sense.

If these approaches prove unweidly, it may make sense to just step back and come up with a plan for how to fix this in its entirety (i.e. get rid of the global *option.DaemonConfig). In the meantime we could also solve this by having the KPR init provider just provide some token dependency (ex. postKPRInit{}) where anything depending on that would be forced to wait for that to complete.

@@ -1356,6 +1356,14 @@ func LogRegisteredOptions(vp *viper.Viper, entry *logrus.Entry) {
}
}

// FinalDaemonConfig is a alias for DaemonConfig used to differentiate what stage of daemon option init
// the config is when it is provided.
type FinalDaemonConfig DaemonConfig
Copy link
Contributor

@joamaki joamaki May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more complicated than it needs to be. Could we instead just use the DaemonConfig type and have the provider for it also do the KPR etc. init/validation? (daemon/cmd/cells.go:91).

I'm worried that things outside NewDaemon might also want to use this flag and they might not be ordered correctly if they don't use FinalDaemonConfig. So seems safer to just have one type whose provider correctly initializes it.

@tommyp1ckles tommyp1ckles marked this pull request as draft May 9, 2024 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/bug This PR fixes an issue in a previous release of Cilium. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants