You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The global instance of *option.DaemonConfig (i.e.option.Config) is our legacy daemon configuration struct that contains many critical config flags. Historically all config was added to this and populated with viper to parse flags and config input.
Since moving to Hive, we've continued to use this for legacy reasons as this is heavily intertwined with many runtime startup processes in newDaemon. To complicate matters more, newDaemon does many parameter overrides and may mutated various fields.
The problem arises from the fact that option.Config is accessed as a global variable, as well as pointer provided to Hive. This latter being misleading in that taking a option.DaemonConfig reference as a dependency does not actually enforce any ordering constraints.
Furthermore, because newDaemon is actually run after the Hive has started, this means that configuration that can be mutated at this stage is not "finalized" until runtime. That is, the only way to synchronize on a finalized version of DaemonConfig is to Await() on the promise.Promise[*Daemon] type.
This is not ideal as it may force moving logic into the "runtime" phase (i.e. after hive.Start()) - essentially re-implementing dependency ordering using promise types.
The ultimate solution would be to break up DaemonConfig and move various bits of configuration into config structs provided by cell.Config. However this is easier said than done, take for example EnableNodeConfig as it can be overridden in at least these three places (specifically, in the order presented):
initKubeProxyReplacement: first pass at initializing kpr config, performs host OS system probing to determine configuration overrides.
device detection: if certain interface devices cannot be detected we disable enable-node-port.
Thus, to make EnableNodePort stable prior to runtime we'd need to move all the above logic in the Populate phase.
This seems not ideal for the following reasons:
Ideally, the "Populate" phase should minimize any IO based tasks, as it's purpose is to create a dependency graph based only on static configuration.
As an example, moving things like KPR probing or device detection to the Populate phase may break things like cilium-agent hive this could now fail due to things like host OS state.
Errors returned from hive.Populate() are not really intended to ergonomically expose user facing errors, as the error would be prefixed with a bunch of hive dependency context.
As described in this issue: cilium#32557
Access to a subset of fields in *DaemonConfig is dangerous as
they are not finalized until runtime.
This makes access to those fields explicit by putting them behind
a call to Volatile() - as well as any derived config function (i.e.
anything that uses a 'volatile' field).
This is intended to clearly distinguish daemon config that is not
finalized until newDaemon is complete.
In subsequent commits we will hook functionality into the Volatile
function to check for improper usage.
Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.co
Is there an existing issue for this?
What happened?
Background
The global instance of
*option.DaemonConfig
(i.e.option.Config
) is our legacy daemon configuration struct that contains many critical config flags. Historically all config was added to this and populated with viper to parse flags and config input.Since moving to Hive, we've continued to use this for legacy reasons as this is heavily intertwined with many runtime startup processes in newDaemon. To complicate matters more, newDaemon does many parameter overrides and may mutated various fields.
For example:
https://github.com/cilium/cilium/blob/main/daemon/cmd/kube_proxy_replacement.go#L55-L59
Problem
The problem arises from the fact that option.Config is accessed as a global variable, as well as pointer provided to Hive. This latter being misleading in that taking a
option.DaemonConfig
reference as a dependency does not actually enforce any ordering constraints.Furthermore, because newDaemon is actually run after the Hive has started, this means that configuration that can be mutated at this stage is not "finalized" until runtime. That is, the only way to synchronize on a finalized version of DaemonConfig is to Await() on the
promise.Promise[*Daemon]
type.This is not ideal as it may force moving logic into the "runtime" phase (i.e. after
hive.Start()
) - essentially re-implementing dependency ordering using promise types.The ultimate solution would be to break up DaemonConfig and move various bits of configuration into config structs provided by
cell.Config
. However this is easier said than done, take for example EnableNodeConfig as it can be overridden in at least these three places (specifically, in the order presented):Thus, to make EnableNodePort stable prior to runtime we'd need to move all the above logic in the Populate phase.
This seems not ideal for the following reasons:
cilium-agent hive
this could now fail due to things like host OS state.hive.Populate()
are not really intended to ergonomically expose user facing errors, as the error would be prefixed with a bunch of hive dependency context.Mitigation and Solutions
Proposed mitigation: #32558
Cilium Version
All existing versions
Kernel Version
n/a
Kubernetes Version
n/a
Regression
n/a
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: