Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate back-off exceptions from 'real' application errors in Listener Micrometer timer metrics for retry topics #2237

Open
theopigott opened this issue Apr 19, 2022 · 6 comments

Comments

@theopigott
Copy link

Expected Behavior

As per the docs on 'Monitoring Listener Performance', there are Micrometer timers called spring.kafka.listener which are tagged with a result (success or failure) and exception. I would expect the metrics generated with the failure tag to capture true failures (e.g. an IOException from some resource that is used to process records). Any back-off exceptions, which are expected to occur for topics with a delay configured, should be treated separately, e.g. with a different tag value for result or exception.

Current Behavior

A failure timer is recorded whenever a RuntimeException occurs while processing a record. When dealing with retry topics, this includes a KafkaBackoffException which may be thrown inside invokeOnMessage (or the batch equivalent) when the listener determines that the timestamp of the latest record is not ready to be processed yet. The exception is always recorded as ListenerExecutionFailedException so there is no way to differentiate back-off exceptions from other exceptions.

Context

I would like to analyze the listener metrics to gain insight into failures (how often they happen, the performance impact, etc.), but I'm interested in application logic failures (e.g. database is unavailable) rather than expected framework level failures (back-off exceptions). I was surprised to see my metrics indicating many failures despite the application logs showing that all records were successfully processed until I realized that the failures must actually be due to these KafkaBackoffExceptions.

I could implement my own timers/metrics inside my KafkaListener, but I would prefer to be able to use the existing timers that are provided by the framework.

@tomazfernandes
Copy link
Contributor

Hi @theopigott, thanks for bringing this up. It should be easy enough to simply ignore KafkaBackOffExceptions - we do have a SeekUtils.isBackOffException method.

I'm wondering though if it'd be interesting to have separate metrics of the KafkaBackOffException - that would enable us to have metrics such as the number of back offs that happened, or maybe even have more granular information such as the partition that was backed off and the time the message should be processed, or maybe the back off time for each occurrence.

What do you think? Thanks.

@theopigott
Copy link
Author

Thanks @tomazfernandes - yes that makes sense for an easy fix.

It could indeed also be interesting to record metrics based on the KafkaBackOffException too. I think that the number of backoffs and the back off time/sleep duration might be of interest for tuning the retry topic configuration - they would give some indication of how idle the listener is due to delayed topics. I wouldn't generate metrics of the absolute time that the message should be processed as that's difficult to normalize and aggregate.

@tomazfernandes
Copy link
Contributor

Makes sense @theopigott, thanks. Is that something you'd be interested in contributing a PR to? If that's ok with @garyrussell of course.

@garyrussell
Copy link
Contributor

Given that it's a behavior change; it would have to be optional (e.g. ContainerProperties new property), the new behavior can be the default but we need to be able to switch back if users want to.

Adding highly variable tags to meters is not recommended (e.g. actual back off time); some metrics back-ends don't handle such variability well.

@theopigott
Copy link
Author

I think that the back off time would be the value of a metric rather than a tag. Then the 'count' would be the number of back-offs that happened, and the 'sum' (for example) would be the total time sleeping due to back-offs.

But would it make sense to first skip treating KafkaBackoffException as failures (with an option to switch back if needed, as you rightly suggest), and then potentially add further metrics about back-offs later in a separate change/PR?

@garyrussell
Copy link
Contributor

Makes sense; thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants