New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Response Ops][Task Manager] Only resetting TM metrics when metrics_reset_interval
has passed since last reset
#182027
Conversation
bbf65aa
to
70b962e
Compare
metrics_reset_interval
has passed since last reset
Pinging @elastic/response-ops (Team:ResponseOps) |
I instrumented the new code to see when it was calculating whether to reset or not, and saw some interesting results:
The two calls are I assume from the double polls (everything-but-reports, then reports). Curious that most of the time the first one resets, the second one doesn't. \o/. But sometimes not, like at 16:35:01. I don't think this is an issue with this code, but is interesting! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Per other comment in the PR, the "double path" through here because of the double polls pretty much works, though I guess there's a bit of a race condition there when it resets on both. I don't think that's a problem - resetting twice so close together should pretty much be a no-op, right? Or would we drop something the first one might have sent?
Beyond that, I let it run, and then did some metric?result=true
calls, which I could force at various times by watching the logs. Worked fine!
I'm not sure what you mean by the double polls...where did you add the log messages? Wondering if there's another bug in this code? 🙈 |
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
To update your PR or re-run it, just comment with: cc @ymao1 |
Did some more testing. The double logs are due to the two different aggregators we create. Each one maintains its own |
…reset_interval` has passed since last reset (elastic#182027) The task manager metrics are collected in a monotonically increasing manner and they can be reset one of two ways: (1) When the metrics are collected via the HTTP API (using an external collection mechanism like metricbeat) and (2) when the configured `metrics_reset_interval` has passed. The second way is there as a fail-safe in case the external collection mechanism fails, however there was a bug where the metrics would always get reset when the `metrics_reset_interval` had passed, instead of only if the `metrics_reset_interval` had passed **since the last collection**. This bug could be causing metrics to be unnecessarily reset even if they're regularly being collected (and reset via that mechanism). Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
The task manager metrics are collected in a monotonically increasing manner and they can be reset one of two ways: (1) When the metrics are collected via the HTTP API (using an external collection mechanism like metricbeat) and (2) when the configured
metrics_reset_interval
has passed. The second way is there as a fail-safe in case the external collection mechanism fails, however there was a bug where the metrics would always get reset when themetrics_reset_interval
had passed, instead of only if themetrics_reset_interval
had passed since the last collection. This bug could be causing metrics to be unnecessarily reset even if they're regularly being collected (and reset via that mechanism).