Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track delay caused by controllers Downtime #119

Open
wants to merge 5 commits into
base: add-kos
Choose a base branch
from

Conversation

matte21
Copy link
Collaborator

@matte21 matte21 commented Jun 17, 2020

Add Prometheus metrics that track the delay by which NetworkAttachments and Subnets are implemented when such delay is caused by one or more KOS controllers being down.

This PR was heavily inspired by the idea outlined in the third paragraph of this comment (and some of the following messages in the same PR), but takes some variations. As pointed out there, tracking all the positive (last_controller_start - last_client_write) for every (last_controller_start, last_client_write) pair and for every "implementation action" requires 1x1 histograms for the subnet validator, 2x2 histograms at the IPAM controller, 2x3 histograms for the creation of local network interfaces by a connection agent, and 4x4 histograms for the creation of remote network interfaces by a connection agent => 2x3 + 4x4 = 22 new histograms for each connection agent, which is a lot. Normally histograms in a HistogramVec can be materialized lazily; we would like that as probably most of the aforementioned histograms would never materialize as they only do when failures occur (actually I'm not sure about this assumption --- if KOS keeps running long enough sooner or later a good portion of them would materialize). But we cannot do that if we want to display the histograms with PromQL's rate(), because for some of the histograms all observations happen during a very short transient of time right after the relevant controller starts => it's likely that all observations happen before the first scrape, so rate() detects no change and the plots spuriously show no value or zero value. Bottom line: a LOT of metrics must be materialized from the beginning even if most of them would never see an observation.

One hack to mitigate the problem is to avoid using PromQL's rate() for plotting the histograms (by plotting things that do not strictly need rate()). This way there's no need to materialize the histograms eagerly. This is what this PR and the corresponding PR in kos-test do.

Another strategy that this PR adopts to reduce the number of metrics is the following. If we assume that controller failures are rare, the likelihood of two controllers failing almost at the same time is even rarer. Under this assumption, most observations would be repeated across histograms for different controllers. Consider for instance the case where the IPAM (and only the IPAM) crashes and restarts at t2 and a NetworkAttachment X (whose subnet already exists) is created during downtime at t1 (obviously t1 < t2). At the new IPAM, upon assigning an address to X an observation t2 - t1 is added to the histogram recording delays in address assignments with labels (last_client_write=NA_creation, last_controller_start=IPAM). Now consider what happens when X is processed by its local CA and a local Network Interface is created: the observation t2 - t1 is added to the histogram recording delays in the creation of local Network Interfaces with the labels (last_client_write=NA_creation, last_controller_start=IPAM). But that delay/observation was already recorded in the histogram recording delays for address assignments. So the idea of this PR is to suppress altogether the histogram recording delays in local Network Interfaces creation with labels (last_client_write=NA_creation, last_controller_start=IPAM). Aside from this specific example, there are duplicated delays/observations between the IPAM controller and the CA (both for local and remote interfaces creation delays) and within the CA between local and remote interfaces creation delays. This PR uses this fact to reduce metrics. More precisely, there are still 1x1 and 2x2 histograms in the subnet validator and IPAM respectively, but the CA histograms for the delay regarding the creation of local network interfaces are reduced to 2x1 while the CA histograms for the delay regarding the creation of remote network interfaces are reduced to 2x1 + 2x1 histograms => from 22 new histograms per-CA to 6 new histograms per-CA.

Of course, it would be better to keep all the histograms, so that we can plot, for instance, the delay in creating local network interfaces for every (last_client_write, last_controller_start) pair right below the plot of the time from last_client_write to creation of local network interface. With the approach of this PR it is not possible: some (last_client_write, last_controller_start) pairs do not exist in the aforementioned histogram but only in the histogram from NetAtt creation to address assignment, which lumps together observations for NetAtts on different nodes (unlike the histograms for delays affecting the creation of local network interfaces). IMO it is worth it: I see the metrics recording delays as something from which one can get a broad idea of what's happening (and this is still possible by looking at all the delay histograms from all the controllers at the same time), rather than a tool to understand what's happening with great precision.

@@ -194,16 +194,28 @@ func (ca *ConnectionAgent) createLocalNetworkInterface(att *netv1a1.NetworkAttac
}
statusErrs = ca.launchCommand(parse.AttNSN(att), ifc.LocalNetIfc, att.Spec.PostCreateExec, ifc.postCreateExecReport.Store, "postCreate", true)
if att.Status.IfcName == "" {
lastClientWrName := att.LastClientWrite.Name
lastClientWrTime := att.LastClientWrite.Time.Time
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR is merged or closed I propose a PR to make all controllers' code switch from time.Time to k8smetav1.MicroTime to avoid the .Time.Time goofiness.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth considering.

staging/kos/pkg/apis/network/types.go Show resolved Hide resolved
Copy link
Owner

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see why we get to assume that controller failures are independent. Particularly when we bundle some together in one pod.

But I do not think your main idea requires assuming that controller failures are independent. The point is that delays recorded by the subnet validator and the IPAM controller do not need to be recorded again by CAs.

// Validate returns a list of errors carrying information on why
// the ExtendedObjectMeta is invalid, or an empty list if the ExtendedObjectMeta
// is valid.
func (eom ExtendedObjectMeta) Validate() field.ErrorList {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have learned that it is considered bad form to put helper functions in the type definition files, even though it makes these helper functions more convenient to use. The theory is that we do not want to force every change (e.g., a bug fix) in the definition of a helper function to be an API change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Is there a standard alternative? Dedicated files in the same pkg? Dedicated pkg under the pkg where the custom types are defined?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same package makes for the same problem, as far as much tooling is concerned, since golang is very much more concerned with packages than files. In Kubernetes the helpers are published in a different repo under k8s.io. Here we do not have multiple repos, so a sufficiently distinct package seems like the right answer to me.

Copy link
Collaborator Author

@matte21 matte21 Sep 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9e5ad76 . There are also other helper functions that are unrelated to this PR and live in the same pkg as the types definitions; those are not moved to keep this PR focused. After this PR I plan one housekeeping PR to do some refactorings and make some parts of the code more concise; the reorganization we're discussing will be done in that occasion.

func (eom ExtendedObjectMeta) Validate() field.ErrorList {
errs := field.ErrorList{}
if !eom.LastClientWrite.IsValid() {
errs = append(errs, field.Invalid(field.NewPath("extendedObjectMeta.lastClientWrite"), eom.LastClientWrite, "only one of \"name\" and \"time\" is set, either none or both must be set."))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd to separate the test from the error message that describes the violation. And this only works if the test function tests only one condition. I think it would be better to delegate the generation of the error(s) along with the testing. For example:

   errs = append(errs, eom.LastClientWrite.Validate(field.NewPath("extendedObjectMeta").Child("lastClientWrite"))...)

Also, it is better to use the existing field constructors one step at a time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored that code in 5c55ad8 , in the spirit of your suggestion.

SubnetClientWrite = "subnet"
NAClientWrite = "na"

SVControllerStart = "subnet_validator"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need "Start" in these names; these names are just about identifiers for controllers. The fact that they are used for recording start times is covered by the type name "ControllerStart".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5c55ad8 .

We probably want to do the same thing to the consts for ClientWrites (right above, lines 227-228), but we have to find a workaround since simply dropping "ClientWrite" causes naming collisions.

@@ -1011,7 +1012,7 @@ func (ca *ConnectionAgent) touchStage1VNState(att k8stypes.NamespacedName, vni u

s1VNS := ca.s1VirtNetsState.vniToVNState[vni]

if !newRelevanceTime.Before(s1VNS.relevanceTime) {
if newRelevanceTime.After(s1VNS.relevanceTime) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change here is that previously the return would happen if the two times are equal, and now the return will not happen in that case. Why is this an improvement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get rid of a negation, what you wrote is true but the "equal" case should be rare in practice. I'm ok with restoring the previous version though.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep the original treatment of the equal case. The negation does not bother me as much as doing unnecessary logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5c55ad8 .

@@ -194,16 +194,28 @@ func (ca *ConnectionAgent) createLocalNetworkInterface(att *netv1a1.NetworkAttac
}
statusErrs = ca.launchCommand(parse.AttNSN(att), ifc.LocalNetIfc, att.Spec.PostCreateExec, ifc.postCreateExecReport.Store, "postCreate", true)
if att.Status.IfcName == "" {
lastClientWrName := att.LastClientWrite.Name
lastClientWrTime := att.LastClientWrite.Time.Time
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth considering.

@@ -107,14 +110,24 @@ func setupPrometheusMetrics() {
Namespace: metricsNamespace,
Subsystem: metricsSubsystem,
Name: "subnet_create_to_validated_latency_seconds",
Help: "Latency from subnet CreationTimestamp to return from update writing validation outcome in status per outcome, in seconds.",
Help: "Latency from subnet NASectionSpec to return from update writing validation outcome in status per outcome, in seconds.",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/NA/Subnet/ here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5c55ad8 .

staging/kos/pkg/apis/network/types.go Show resolved Hide resolved
@MikeSpreitzer
Copy link
Owner

MikeSpreitzer commented Jun 25, 2020

I have two comments regarding cutting down on the number of histograms of ((last controller start) - (last client write)).

  1. Can't we apply the same idea to the IPAM controller, suppressing it's recording of ((SV start) - (Subnet create))?

  2. Regarding the delays for a remote NA, I have to take a step back. Upon reconsidering, I still see 6 possible values for (last controller start): there are two IP address assignments involved, and possibly two subnet validations involved, so four rather than two possible controller starts there, plus the two CA starts for a total of 6. From the 6x4 histograms, I can see suppressing 6+6: when last controller start and last client write are both about the same NA, the delay will be recorded as a delay in creating the local implementation for that NA. Consider the following example from the remaining 12 histograms. Suppose NA A is placed on Node X and NA B is placed on Node Y. Suppose the last controller start is the IPAM controller run that assigned A's IP address. Suppose last client write is creation of NA B. Y's CA will not record this delay as part of creating its local implementation of B, nor will X's CA record this delay as part of creating its local implementation of A (although something will record a larger delay: (IPAM A start) - (NA A creation)). If we are willing to accept recording only the larger delay, then we can subtract 12 more histograms, getting down to zero.

@matte21
Copy link
Collaborator Author

matte21 commented Jun 25, 2020

I do not see why we get to assume that controller failures are independent. Particularly when we bundle some together in one pod.

But I do not think your main idea requires assuming that controller failures are independent.

Yes, forget about that claim.

  1. Can't we apply the same idea to the IPAM controller, suppressing it's recording of ((SV start) - (Subnet create))?

If a subnet is validated with a delay, that delay affects a potentially high number of NAs, and it's useful to know that number. Discarding that histogram from the IPAM makes that impossible. One might correctly argue that we're still suppressing histograms without which we cannot count the number of delayed NAs. For instance, there's no histogram to track the delay for remote network interfaces on a node N given by (IPAM Start) - (1st local NA creation time) on the grounds that such delay is already tracked by a histogram on the delay in IP assignments at the IPAM controller. But that delay does not delay only the address assignment of the 1st local NA on N (say X), it also delays the creation of the remote network interfaces of some of the NAs with X's VNI that were created before X, and potentially there are many of them, and it would be nice to know that number. But knowing that number requires too many histograms because we can have a lot of CAs so we choose to lose information. In the case of the IPAM controller, we need not worry about reducing the number of metrics, because it's a singleton. I do see the counterargument that this makes the design inconsistent though.

  1. Regarding the delays for a remote NA, I have to take a step back. Upon reconsidering, I still see 6 possible values for (last controller start): ...

I do not see the advantage/necessity of discerning the IPAM/SV starts for the two NAs. I see no practical benefit (e.g., reduction of the number of metrics, more precise recording of delays, etc... ). Conceptually, ignoring the distinction is not unsound: what matters is that if when the last client write happened one of the controllers was not running there will be a delay and we record it, and that's it, it doesn't really matter which implementation step was delayed (e.g., assignment of NA A's IP vs assignment of NA B's IP).
A nit: in the example you described, delays are recorded by the IPAM controller, not by the CAs as written.

Copy link
Collaborator Author

@matte21 matte21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done yet answering all your comments... but some answers are already there.

// Validate returns a list of errors carrying information on why
// the ExtendedObjectMeta is invalid, or an empty list if the ExtendedObjectMeta
// is valid.
func (eom ExtendedObjectMeta) Validate() field.ErrorList {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Is there a standard alternative? Dedicated files in the same pkg? Dedicated pkg under the pkg where the custom types are defined?

@@ -1011,7 +1012,7 @@ func (ca *ConnectionAgent) touchStage1VNState(att k8stypes.NamespacedName, vni u

s1VNS := ca.s1VirtNetsState.vniToVNState[vni]

if !newRelevanceTime.Before(s1VNS.relevanceTime) {
if newRelevanceTime.After(s1VNS.relevanceTime) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get rid of a negation, what you wrote is true but the "equal" case should be rare in practice. I'm ok with restoring the previous version though.

staging/kos/pkg/apis/network/types.go Show resolved Hide resolved
@MikeSpreitzer
Copy link
Owner

I have been trying to cut down on the number of histograms reported. I recognize that this loses information about counts. As you noted, we already are losing information about counts. I agree we can keep a high level of ambition when it adds few histograms (as in the case of the IPAM controller noting delays due to subnet validator start), even though this is not totally consistent with giving up on some other precision regarding counts.

The value of distinguishing all six relevant controller starts in the remote NA case is added precision in explanations and in thinking about the problem. Yes, we can lose information as we put the observations into histograms, as long as we clearly keep the distinction in the documentation and discussion.

Regarding the nit noted at the end of #119 (comment) , I updated the original text to more clearly express the imprecision that I meant.

@matte21
Copy link
Collaborator Author

matte21 commented Jul 9, 2020

Let me recap the discussion before we go ahead and code.

What we agreed upon so far is that there will be no suppression of IPAM histograms, not even those which track a delay already measured by the subnet validator. We're accepting the inconsistency/ugliness that this entails.
Another thing we've implicitly agreed upon is that there will be no changes in the metrics produced by the subnet validator, the IPAM controller and the processing of local NAs.

Things are less defined for the delay affecting remote interfaces.
If we want to be thorough, we must track delays distinguishing all 6 controller starts. This makes for 6x4 histograms per CA which hinders scalability too much.
So we want to suppress histograms that contain delays that are already tracked: this allows removing 12 histograms (those for which the last client write and last controller start are about the same NA). The drawback of this suppression is that in some cases a delay that is counted only once could affect multiple NAs, that's not ideal but we're willing to pay the price.
The next thing we can do to reduce the number of metrics is suppressing histograms that record delays for which an upper bound is already observed; the drawback is the same as for suppressing histograms on delays that are already recorded plus the fact that an upper bound is, by definition, less accurate than the bounded value.
These histograms are for sure the 3x2 histograms where the last client write is from the 1st local NA and the last controller start is from the remote NA.
Following the same logic, we can also suppress the 2x2 histograms where the last client write is from the remote NA and the last controller start is either the IPAM or the subnet validator that processed the 1st local NA (or its subnet).
We CANNOT suppress the 2x1 histograms where the last client write is from the remote NA and the last controller start is the CA where the remote interface is created, because that CA might be different from the one that implemented (and recorded the corresponding delay, which would serve as the upper bound here) the 1st local NA even if they run on the same node (see this comment for an example). This is another inconsistency/ugliness.

I'm afraid I don't have a clear-cut opinion.
I'm ok with suppressing as many metrics as possible if we are aware of what information we're losing.
OTOH, that clashes with the idea of distinguishing all 6 relevant controller starts to have metrics that give us a better understanding. One thing in favor of being precise to keep in mind is that while the number of histograms tracking the delay is potentially extremely high, they are materialized lazily (this is possible because the Grafana dashboards never invoke PromQL's rate() on those histograms) i.e. only after a CA failure occurs and if NAs are created during downtime. On a real-world cloud data center there would be many CA failures for sure after some time, but in percentage they would be a small subset of the nodes (unless an unexpected, severe event occurs, which is possible of course). For example, in the hypothesis of 26 metrics per CA (2x1 local interfaces + 6x24 remote interfaces) tracking the delay in a cluster with 100k CA nodes, if after some time 10k nodes have failed, there would be 260k additional active metrics tracking delays (this is not precise but I think I remember reading somewhere that Prometheus can handle up to around 2 millions active time series).

@matte21
Copy link
Collaborator Author

matte21 commented Jul 28, 2020

I just re-read the whole discussion.

I now fully agree that it's better to discern the starts of controllers that processed the 1st local NA (or its subnet) from the starts of the controllers that processed the remote NA (or its subnet) when recording latencies.
Without that distinction, there are cases where it's impossible to tell whether a latency that is currently not observed on the grounds that it was previously observed at a histogram at a controller earlier in the pipeline was actually observed at that earlier controller.

I am also ok with suppressing observations for which an upper bound was already observed, even if this entails accepting an inconsistency in the design. This means that all histograms on the delay for remote interfaces can be suppressed, with two exceptions. More precisely, if a NA X is remote with respect to a node N and a remote network interface for X is created on N, the two histograms that cannot be suppressed are the ones where the last client write is for either X or X's subnet and the last controller start is the CA on N . An example of the why is explained here. I am ok with keeping only those two histograms. However, notice that this inconsistency might look confusing to an operator looking at the dashboards; there would only be two dashboards while intuitively there should be 24.

@MikeSpreitzer
Copy link
Owner

I do not follow the reduction from 100K to 10K in the last paragraph of #119 (comment) . We have been planning to populate all the histograms from the start, so with a 6x4 vector of histograms there will be 24 populated histograms for each node. Even if nobody looks at a Grafana panel that queries them, their data will be scraped and stored.

I do not like scrunching down the published controller delay metrics to only 2x1 histograms for remote NAs. But I think it may be the most practical approach we can think of right now. I think it is good enough to proceed with.

@matte21
Copy link
Collaborator Author

matte21 commented Jul 30, 2020

I do not like scrunching down the published controller delay metrics to only 2x1 histograms for remote NAs. But I think it may be the most practical approach we can think of right now. I think it is good enough to proceed with.

I share both your dislike for the reduction (and the inconsistency that it entails) as well as the conclusion that as of now it would be an acceptable solution. I will proceed with it.

I am starting to get the feeling that Prometheus is inadequate for our needs and we'll have a really hard time finding a precise yet scalable solution.
Perhaps I came across an alternative for us: Cortex.
It retains PromQL and Prometheus data model but claims to be scalable. IIUC it's just machinery on top of multiple Prometheus instances, which makes me think that the re-engineering effort to migrate KOS from Prometheus to Cortex would be limited.

I do not follow the reduction from 100K to 10K in the last paragraph of #119 (comment) . We have been planning to populate all the histograms from the start, so with a 6x4 vector of histograms there will be 24 populated histograms for each node. Even if nobody looks at a Grafana panel that queries them, their data will be scraped and stored.

Currently histograms are populated lazily and I was not planning on changing that. If you assume histograms are populated lazily the argument about the reduction from 100K to 10K in the last paragraph of #119 (comment) will make sense. However, I just thought about it and realized that while lazy population is OK for the histograms we currently have and for those we will have after the reduction to only 2x1 histograms, there are many histograms in the 6x4 vector that we are currently suppressing that would not work well with lazy population.

@matte21
Copy link
Collaborator Author

matte21 commented Aug 27, 2020

We CANNOT suppress the 2x1 histograms where the last client write is from the remote NA and the last controller start is the CA where the remote interface is created, because that CA might be different from the one that implemented (and recorded the corresponding delay, which would serve as the upper bound here) the 1st local NA even if they run on the same node (see this comment for an example). This is another inconsistency/ugliness.

On further reflection, there are 4x1 histograms we cannot suppress. Regardless of what is the last client write, if the last controller to start is the CA that creates the remote interface there's no guarantee that an upperbound delay has been already recorded, the argument is the same as the one in the quote above.

Tweak the way we record the delay due to downtime for remote
interfaces: only the delays where the last controller start
is the remote connection agent are tracked because we want
to minimize the number of metrics (for scalability) and those
are the only delays which are not guaranteed to have been already
tracked (either directly or indirectly via an upper-bound).
Move the helper functions for new API types (ExtendedObjectMeta and
children) to a dedicate package to avoid changing the API anytime
a helper implementation changes.

Helper functions dealing with API types which were already there
are not moved to keep the PR focused, that is left for the future.
@matte21 matte21 force-pushed the track-delay-caused-by-controllers-downtime branch from 8a483a4 to 6a67d3b Compare September 6, 2020 15:17
@matte21
Copy link
Collaborator Author

matte21 commented Sep 6, 2020

6fba2d2 suppresses metrics on remote interface creation delay to 4x1 as described in my previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants