Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align connects metrics to prevent no data scenarios #670

Open
endorama opened this issue Jan 31, 2024 · 3 comments
Open

Align connects metrics to prevent no data scenarios #670

endorama opened this issue Jan 31, 2024 · 3 comments
Labels
enhancement New feature or request has pr plugin PRs related to plugins

Comments

@endorama
Copy link

Hello, while using this library and the kotel plugin for monitoring some consumers of Kafka clusters we noticed a pattern in the connection metrics that is creating some minor issues that we'd like to address.

We have an alert that monitors the value of messaging.kafka.connect_errors.count to inform us in case the number is out of the ordinary. We also want to alert in case this metric stop being sent by the consumers, as that means we have a blind spot in our monitoring that we need to address.

At the moment this metric is emitted only when connection errors happens, while a sibling metric in only emitted when the connection succeed.

func (m *Meter) OnBrokerConnect(meta kgo.BrokerMetadata, _ time.Duration, _ net.Conn, err error) {
node := strnode(meta.NodeID)
attributes := attribute.NewSet(attribute.String("node_id", node))
if err != nil {
m.instruments.connectErrs.Add(
context.Background(),
1,
metric.WithAttributeSet(attributes),
)
return
}
m.instruments.connects.Add(
context.Background(),
1,
metric.WithAttributeSet(attributes),
)
}

Having 2 metrics means that both stop sending metrics at some point for an undefined period of time.

This is mostly not an issues for connects.count, as for working clusters it is emitted by kotel most of the time, but leaves wide gaps for connect_errors.count. From the monitoring tool, is impossible to establish during this gap if the metric stopped being received because there are no errors or because the instrumentation stopped working.

A possible solution for this issue but maintain the semantic, is to rely on a single metric with different attribute. For example, using messaging.kafka.connects.count with an outcome attribute with 2 values, success or failure.

Would you be open to accept a PR that changes the monitor to rely on messaging.kafka.connects.count as single metric with different attributes?

Thank you for reading this issue and for this library!

@twmb
Copy link
Owner

twmb commented Feb 7, 2024

I think adding attributes technically changes things? I don't know how monitoring otel works.

Does this same problem exist for write_errs and read_errs?

I think rather than change anything by default, another option could be introduced,

// WithMergedConnectsMeter merges the `messaging.kafka.connect_errors.count`
// meter into the `messaging.kafka.connect_errors.count` meter, adding an attribute
// "outcome" with the values "success" or "failure". This option may be used because
// ...
func WithMergedConnectsMeter() MeterOpt

@twmb twmb added enhancement New feature or request plugin PRs related to plugins labels Feb 7, 2024
@endorama
Copy link
Author

endorama commented Feb 9, 2024

I think adding attributes technically changes things? I don't know how monitoring otel works.

I'm not sure I understand but let me try to clarify: a metric can have multiple attribute but would still be considered a single metric (source: OpenTelemetry Protocol data model).

Does this same problem exist for write_errs and read_errs?

I would say no because:

  • there are no "write" or "read" metrics to count successes.
  • write and read are 2 separate operations, and in the spirit of controlling metric cardinality it make sense to have them as 2 separate counters. (This opposed to connect and connectErrs where the operation is 1, connecting to the broker but represented by 2 metrics)

I think rather than change anything by default, another option could be introduced,

That would be fine for me and I can submit a PR so we can discuss the implementation.
An alternative could be to consider this a breaking change and create a kotel/v2, duplicating the current plugin and applying the mentioned changes.

Let me know how would you prefer to move forward and thank you for looking into this.

@twmb
Copy link
Owner

twmb commented Feb 22, 2024

I'm not a big fan of v2, especially for only one small change. +1 to a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request has pr plugin PRs related to plugins
Projects
None yet
Development

No branches or pull requests

2 participants