-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track delay caused by controllers Downtime #119
base: add-kos
Are you sure you want to change the base?
Conversation
@@ -194,16 +194,28 @@ func (ca *ConnectionAgent) createLocalNetworkInterface(att *netv1a1.NetworkAttac | |||
} | |||
statusErrs = ca.launchCommand(parse.AttNSN(att), ifc.LocalNetIfc, att.Spec.PostCreateExec, ifc.postCreateExecReport.Store, "postCreate", true) | |||
if att.Status.IfcName == "" { | |||
lastClientWrName := att.LastClientWrite.Name | |||
lastClientWrTime := att.LastClientWrite.Time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this PR is merged or closed I propose a PR to make all controllers' code switch from time.Time
to k8smetav1.MicroTime
to avoid the .Time.Time
goofiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see why we get to assume that controller failures are independent. Particularly when we bundle some together in one pod.
But I do not think your main idea requires assuming that controller failures are independent. The point is that delays recorded by the subnet validator and the IPAM controller do not need to be recorded again by CAs.
// Validate returns a list of errors carrying information on why | ||
// the ExtendedObjectMeta is invalid, or an empty list if the ExtendedObjectMeta | ||
// is valid. | ||
func (eom ExtendedObjectMeta) Validate() field.ErrorList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have learned that it is considered bad form to put helper functions in the type definition files, even though it makes these helper functions more convenient to use. The theory is that we do not want to force every change (e.g., a bug fix) in the definition of a helper function to be an API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Is there a standard alternative? Dedicated files in the same pkg? Dedicated pkg under the pkg where the custom types are defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the same package makes for the same problem, as far as much tooling is concerned, since golang is very much more concerned with packages than files. In Kubernetes the helpers are published in a different repo under k8s.io. Here we do not have multiple repos, so a sufficiently distinct package seems like the right answer to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9e5ad76 . There are also other helper functions that are unrelated to this PR and live in the same pkg as the types definitions; those are not moved to keep this PR focused. After this PR I plan one housekeeping PR to do some refactorings and make some parts of the code more concise; the reorganization we're discussing will be done in that occasion.
func (eom ExtendedObjectMeta) Validate() field.ErrorList { | ||
errs := field.ErrorList{} | ||
if !eom.LastClientWrite.IsValid() { | ||
errs = append(errs, field.Invalid(field.NewPath("extendedObjectMeta.lastClientWrite"), eom.LastClientWrite, "only one of \"name\" and \"time\" is set, either none or both must be set.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd to separate the test from the error message that describes the violation. And this only works if the test function tests only one condition. I think it would be better to delegate the generation of the error(s) along with the testing. For example:
errs = append(errs, eom.LastClientWrite.Validate(field.NewPath("extendedObjectMeta").Child("lastClientWrite"))...)
Also, it is better to use the existing field constructors one step at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored that code in 5c55ad8 , in the spirit of your suggestion.
SubnetClientWrite = "subnet" | ||
NAClientWrite = "na" | ||
|
||
SVControllerStart = "subnet_validator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need "Start" in these names; these names are just about identifiers for controllers. The fact that they are used for recording start times is covered by the type name "ControllerStart".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 5c55ad8 .
We probably want to do the same thing to the consts for ClientWrites (right above, lines 227-228), but we have to find a workaround since simply dropping "ClientWrite" causes naming collisions.
@@ -1011,7 +1012,7 @@ func (ca *ConnectionAgent) touchStage1VNState(att k8stypes.NamespacedName, vni u | |||
|
|||
s1VNS := ca.s1VirtNetsState.vniToVNState[vni] | |||
|
|||
if !newRelevanceTime.Before(s1VNS.relevanceTime) { | |||
if newRelevanceTime.After(s1VNS.relevanceTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change here is that previously the return
would happen if the two times are equal, and now the return
will not happen in that case. Why is this an improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get rid of a negation, what you wrote is true but the "equal" case should be rare in practice. I'm ok with restoring the previous version though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to keep the original treatment of the equal case. The negation does not bother me as much as doing unnecessary logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 5c55ad8 .
@@ -194,16 +194,28 @@ func (ca *ConnectionAgent) createLocalNetworkInterface(att *netv1a1.NetworkAttac | |||
} | |||
statusErrs = ca.launchCommand(parse.AttNSN(att), ifc.LocalNetIfc, att.Spec.PostCreateExec, ifc.postCreateExecReport.Store, "postCreate", true) | |||
if att.Status.IfcName == "" { | |||
lastClientWrName := att.LastClientWrite.Name | |||
lastClientWrTime := att.LastClientWrite.Time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth considering.
@@ -107,14 +110,24 @@ func setupPrometheusMetrics() { | |||
Namespace: metricsNamespace, | |||
Subsystem: metricsSubsystem, | |||
Name: "subnet_create_to_validated_latency_seconds", | |||
Help: "Latency from subnet CreationTimestamp to return from update writing validation outcome in status per outcome, in seconds.", | |||
Help: "Latency from subnet NASectionSpec to return from update writing validation outcome in status per outcome, in seconds.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/NA/Subnet/ here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 5c55ad8 .
I have two comments regarding cutting down on the number of histograms of ((last controller start) - (last client write)).
|
staging/kos/pkg/controllers/connectionagent/network_interface.go
Outdated
Show resolved
Hide resolved
staging/kos/pkg/controllers/connectionagent/network_interface.go
Outdated
Show resolved
Hide resolved
Yes, forget about that claim.
If a subnet is validated with a delay, that delay affects a potentially high number of NAs, and it's useful to know that number. Discarding that histogram from the IPAM makes that impossible. One might correctly argue that we're still suppressing histograms without which we cannot count the number of delayed NAs. For instance, there's no histogram to track the delay for remote network interfaces on a node N given by
I do not see the advantage/necessity of discerning the IPAM/SV starts for the two NAs. I see no practical benefit (e.g., reduction of the number of metrics, more precise recording of delays, etc... ). Conceptually, ignoring the distinction is not unsound: what matters is that if when the last client write happened one of the controllers was not running there will be a delay and we record it, and that's it, it doesn't really matter which implementation step was delayed (e.g., assignment of NA A's IP vs assignment of NA B's IP). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not done yet answering all your comments... but some answers are already there.
staging/kos/pkg/controllers/connectionagent/network_interface.go
Outdated
Show resolved
Hide resolved
staging/kos/pkg/controllers/connectionagent/network_interface.go
Outdated
Show resolved
Hide resolved
// Validate returns a list of errors carrying information on why | ||
// the ExtendedObjectMeta is invalid, or an empty list if the ExtendedObjectMeta | ||
// is valid. | ||
func (eom ExtendedObjectMeta) Validate() field.ErrorList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Is there a standard alternative? Dedicated files in the same pkg? Dedicated pkg under the pkg where the custom types are defined?
@@ -1011,7 +1012,7 @@ func (ca *ConnectionAgent) touchStage1VNState(att k8stypes.NamespacedName, vni u | |||
|
|||
s1VNS := ca.s1VirtNetsState.vniToVNState[vni] | |||
|
|||
if !newRelevanceTime.Before(s1VNS.relevanceTime) { | |||
if newRelevanceTime.After(s1VNS.relevanceTime) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get rid of a negation, what you wrote is true but the "equal" case should be rare in practice. I'm ok with restoring the previous version though.
I have been trying to cut down on the number of histograms reported. I recognize that this loses information about counts. As you noted, we already are losing information about counts. I agree we can keep a high level of ambition when it adds few histograms (as in the case of the IPAM controller noting delays due to subnet validator start), even though this is not totally consistent with giving up on some other precision regarding counts. The value of distinguishing all six relevant controller starts in the remote NA case is added precision in explanations and in thinking about the problem. Yes, we can lose information as we put the observations into histograms, as long as we clearly keep the distinction in the documentation and discussion. Regarding the nit noted at the end of #119 (comment) , I updated the original text to more clearly express the imprecision that I meant. |
Let me recap the discussion before we go ahead and code. What we agreed upon so far is that there will be no suppression of IPAM histograms, not even those which track a delay already measured by the subnet validator. We're accepting the inconsistency/ugliness that this entails. Things are less defined for the delay affecting remote interfaces. I'm afraid I don't have a clear-cut opinion. |
I just re-read the whole discussion. I now fully agree that it's better to discern the starts of controllers that processed the 1st local NA (or its subnet) from the starts of the controllers that processed the remote NA (or its subnet) when recording latencies. I am also ok with suppressing observations for which an upper bound was already observed, even if this entails accepting an inconsistency in the design. This means that all histograms on the delay for remote interfaces can be suppressed, with two exceptions. More precisely, if a NA X is remote with respect to a node N and a remote network interface for X is created on N, the two histograms that cannot be suppressed are the ones where the last client write is for either X or X's subnet and the last controller start is the CA on N . An example of the why is explained here. I am ok with keeping only those two histograms. However, notice that this inconsistency might look confusing to an operator looking at the dashboards; there would only be two dashboards while intuitively there should be 24. |
I do not follow the reduction from 100K to 10K in the last paragraph of #119 (comment) . We have been planning to populate all the histograms from the start, so with a 6x4 vector of histograms there will be 24 populated histograms for each node. Even if nobody looks at a Grafana panel that queries them, their data will be scraped and stored. I do not like scrunching down the published controller delay metrics to only 2x1 histograms for remote NAs. But I think it may be the most practical approach we can think of right now. I think it is good enough to proceed with. |
I share both your dislike for the reduction (and the inconsistency that it entails) as well as the conclusion that as of now it would be an acceptable solution. I will proceed with it. I am starting to get the feeling that Prometheus is inadequate for our needs and we'll have a really hard time finding a precise yet scalable solution.
Currently histograms are populated lazily and I was not planning on changing that. If you assume histograms are populated lazily the argument about the reduction from 100K to 10K in the last paragraph of #119 (comment) will make sense. However, I just thought about it and realized that while lazy population is OK for the histograms we currently have and for those we will have after the reduction to only 2x1 histograms, there are many histograms in the 6x4 vector that we are currently suppressing that would not work well with lazy population. |
On further reflection, there are 4x1 histograms we cannot suppress. Regardless of what is the last client write, if the last controller to start is the CA that creates the remote interface there's no guarantee that an upperbound delay has been already recorded, the argument is the same as the one in the quote above. |
Tweak the way we record the delay due to downtime for remote interfaces: only the delays where the last controller start is the remote connection agent are tracked because we want to minimize the number of metrics (for scalability) and those are the only delays which are not guaranteed to have been already tracked (either directly or indirectly via an upper-bound).
Move the helper functions for new API types (ExtendedObjectMeta and children) to a dedicate package to avoid changing the API anytime a helper implementation changes. Helper functions dealing with API types which were already there are not moved to keep the PR focused, that is left for the future.
8a483a4
to
6a67d3b
Compare
6fba2d2 suppresses metrics on remote interface creation delay to 4x1 as described in my previous comment. |
Add Prometheus metrics that track the delay by which NetworkAttachments and Subnets are implemented when such delay is caused by one or more KOS controllers being down.
This PR was heavily inspired by the idea outlined in the third paragraph of this comment (and some of the following messages in the same PR), but takes some variations. As pointed out there, tracking all the positive
(last_controller_start - last_client_write)
for every(last_controller_start, last_client_write)
pair and for every "implementation action" requires 1x1 histograms for the subnet validator, 2x2 histograms at the IPAM controller, 2x3 histograms for the creation of local network interfaces by a connection agent, and 4x4 histograms for the creation of remote network interfaces by a connection agent => 2x3 + 4x4 = 22 new histograms for each connection agent, which is a lot. Normally histograms in a HistogramVec can be materialized lazily; we would like that as probably most of the aforementioned histograms would never materialize as they only do when failures occur (actually I'm not sure about this assumption --- if KOS keeps running long enough sooner or later a good portion of them would materialize). But we cannot do that if we want to display the histograms with PromQL'srate()
, because for some of the histograms all observations happen during a very short transient of time right after the relevant controller starts => it's likely that all observations happen before the first scrape, sorate()
detects no change and the plots spuriously show no value or zero value. Bottom line: a LOT of metrics must be materialized from the beginning even if most of them would never see an observation.One hack to mitigate the problem is to avoid using PromQL's
rate()
for plotting the histograms (by plotting things that do not strictly needrate()
). This way there's no need to materialize the histograms eagerly. This is what this PR and the corresponding PR in kos-test do.Another strategy that this PR adopts to reduce the number of metrics is the following. If we assume that controller failures are rare, the likelihood of two controllers failing almost at the same time is even rarer. Under this assumption, most observations would be repeated across histograms for different controllers. Consider for instance the case where the IPAM (and only the IPAM) crashes and restarts at
t2
and a NetworkAttachment X (whose subnet already exists) is created during downtime att1
(obviouslyt1 < t2
). At the new IPAM, upon assigning an address to X an observationt2 - t1
is added to the histogram recording delays in address assignments with labels(last_client_write=NA_creation, last_controller_start=IPAM)
. Now consider what happens when X is processed by its local CA and a local Network Interface is created: the observationt2 - t1
is added to the histogram recording delays in the creation of local Network Interfaces with the labels(last_client_write=NA_creation, last_controller_start=IPAM)
. But that delay/observation was already recorded in the histogram recording delays for address assignments. So the idea of this PR is to suppress altogether the histogram recording delays in local Network Interfaces creation with labels(last_client_write=NA_creation, last_controller_start=IPAM)
. Aside from this specific example, there are duplicated delays/observations between the IPAM controller and the CA (both for local and remote interfaces creation delays) and within the CA between local and remote interfaces creation delays. This PR uses this fact to reduce metrics. More precisely, there are still 1x1 and 2x2 histograms in the subnet validator and IPAM respectively, but the CA histograms for the delay regarding the creation of local network interfaces are reduced to 2x1 while the CA histograms for the delay regarding the creation of remote network interfaces are reduced to 2x1 + 2x1 histograms => from 22 new histograms per-CA to 6 new histograms per-CA.Of course, it would be better to keep all the histograms, so that we can plot, for instance, the delay in creating local network interfaces for every
(last_client_write, last_controller_start)
pair right below the plot of the time fromlast_client_write
to creation of local network interface. With the approach of this PR it is not possible: some(last_client_write, last_controller_start)
pairs do not exist in the aforementioned histogram but only in the histogram from NetAtt creation to address assignment, which lumps together observations for NetAtts on different nodes (unlike the histograms for delays affecting the creation of local network interfaces). IMO it is worth it: I see the metrics recording delays as something from which one can get a broad idea of what's happening (and this is still possible by looking at all the delay histograms from all the controllers at the same time), rather than a tool to understand what's happening with great precision.