Align connects metrics to prevent no data scenarios #670

endorama · 2024-01-31T16:21:20Z

Hello, while using this library and the kotel plugin for monitoring some consumers of Kafka clusters we noticed a pattern in the connection metrics that is creating some minor issues that we'd like to address.

We have an alert that monitors the value of messaging.kafka.connect_errors.count to inform us in case the number is out of the ordinary. We also want to alert in case this metric stop being sent by the consumers, as that means we have a blind spot in our monitoring that we need to address.

At the moment this metric is emitted only when connection errors happens, while a sibling metric in only emitted when the connection succeed.

franz-go/plugin/kotel/meter.go

Lines 233 to 249 in a2d69ce

    
           func (m *Meter) OnBrokerConnect(meta kgo.BrokerMetadata, _ time.Duration, _ net.Conn, err error) { 
        
           	node := strnode(meta.NodeID) 
        
           	attributes := attribute.NewSet(attribute.String("node_id", node)) 
        
           	if err != nil { 
        
           		m.instruments.connectErrs.Add( 
        
           			context.Background(), 
        
           			1, 
        
           			metric.WithAttributeSet(attributes), 
        
           		) 
        
           		return 
        
           	} 
        
           	m.instruments.connects.Add( 
        
           		context.Background(), 
        
           		1, 
        
           		metric.WithAttributeSet(attributes), 
        
           	) 
        
           }

Having 2 metrics means that both stop sending metrics at some point for an undefined period of time.

This is mostly not an issues for connects.count, as for working clusters it is emitted by kotel most of the time, but leaves wide gaps for connect_errors.count. From the monitoring tool, is impossible to establish during this gap if the metric stopped being received because there are no errors or because the instrumentation stopped working.

A possible solution for this issue but maintain the semantic, is to rely on a single metric with different attribute. For example, using messaging.kafka.connects.count with an outcome attribute with 2 values, success or failure.

Would you be open to accept a PR that changes the monitor to rely on messaging.kafka.connects.count as single metric with different attributes?

Thank you for reading this issue and for this library!

The text was updated successfully, but these errors were encountered:

twmb · 2024-02-07T00:23:01Z

I think adding attributes technically changes things? I don't know how monitoring otel works.

Does this same problem exist for write_errs and read_errs?

I think rather than change anything by default, another option could be introduced,

// WithMergedConnectsMeter merges the `messaging.kafka.connect_errors.count`
// meter into the `messaging.kafka.connect_errors.count` meter, adding an attribute
// "outcome" with the values "success" or "failure". This option may be used because
// ...
func WithMergedConnectsMeter() MeterOpt

endorama · 2024-02-09T10:55:42Z

I think adding attributes technically changes things? I don't know how monitoring otel works.

I'm not sure I understand but let me try to clarify: a metric can have multiple attribute but would still be considered a single metric (source: OpenTelemetry Protocol data model).

Does this same problem exist for write_errs and read_errs?

I would say no because:

there are no "write" or "read" metrics to count successes.
write and read are 2 separate operations, and in the spirit of controlling metric cardinality it make sense to have them as 2 separate counters. (This opposed to connect and connectErrs where the operation is 1, connecting to the broker but represented by 2 metrics)

I think rather than change anything by default, another option could be introduced,

That would be fine for me and I can submit a PR so we can discuss the implementation.
An alternative could be to consider this a breaking change and create a kotel/v2, duplicating the current plugin and applying the mentioned changes.

Let me know how would you prefer to move forward and thank you for looking into this.

twmb · 2024-02-22T20:38:43Z

I'm not a big fan of v2, especially for only one small change. +1 to a PR

twmb added enhancement New feature or request plugin PRs related to plugins labels Feb 7, 2024

endorama mentioned this issue Mar 11, 2024

kotel: optionally only use messaging.kafka.connects.count for connection metrics #691

Open

twmb added the has pr label Mar 26, 2024

endorama mentioned this issue May 23, 2024

Leverage connects metrics for success and failure elastic/apm-queue#479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align connects metrics to prevent no data scenarios #670

Align connects metrics to prevent no data scenarios #670

endorama commented Jan 31, 2024

twmb commented Feb 7, 2024

endorama commented Feb 9, 2024

twmb commented Feb 22, 2024

Align connects metrics to prevent no data scenarios #670

Align connects metrics to prevent no data scenarios #670

Comments

endorama commented Jan 31, 2024

twmb commented Feb 7, 2024

endorama commented Feb 9, 2024

twmb commented Feb 22, 2024