[Response Ops][Task Manager] Only resetting TM metrics when `metrics_reset_interval` has passed since last reset #182027

ymao1 · 2024-04-29T19:33:01Z

The task manager metrics are collected in a monotonically increasing manner and they can be reset one of two ways: (1) When the metrics are collected via the HTTP API (using an external collection mechanism like metricbeat) and (2) when the configured metrics_reset_interval has passed. The second way is there as a fail-safe in case the external collection mechanism fails, however there was a bug where the metrics would always get reset when the metrics_reset_interval had passed, instead of only if the metrics_reset_interval had passed since the last collection. This bug could be causing metrics to be unnecessarily reset even if they're regularly being collected (and reset via that mechanism).

…_reset_interval

elasticmachine · 2024-04-29T22:14:29Z

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr · 2024-05-01T16:47:20Z

I instrumented the new code to see when it was calculating whether to reset or not, and saw some interesting results:

2024-05-01T16:34:01.765Z tm metrics: resetting because interval has passed
2024-05-01T16:34:01.769Z tm metrics: not resetting because interval has not passed

2024-05-01T16:34:31.765Z tm metrics: not resetting because interval has not passed
2024-05-01T16:34:31.769Z tm metrics: resetting because interval has passed

2024-05-01T16:35:01.764Z tm metrics: resetting because interval has passed
2024-05-01T16:35:01.770Z tm metrics: resetting because interval has passed

2024-05-01T16:35:31.765Z tm metrics: resetting because interval has passed
2024-05-01T16:35:31.769Z tm metrics: not resetting because interval has not passed

The two calls are I assume from the double polls (everything-but-reports, then reports). Curious that most of the time the first one resets, the second one doesn't. \o/. But sometimes not, like at 16:35:01.

I don't think this is an issue with this code, but is interesting!

pmuellr

LGTM

Per other comment in the PR, the "double path" through here because of the double polls pretty much works, though I guess there's a bit of a race condition there when it resets on both. I don't think that's a problem - resetting twice so close together should pretty much be a no-op, right? Or would we drop something the first one might have sent?

Beyond that, I let it run, and then did some metric?result=true calls, which I could force at various times by watching the logs. Worked fine!

ymao1 · 2024-05-01T20:22:02Z

double polls

I'm not sure what you mean by the double polls...where did you add the log messages? Wondering if there's another bug in this code? 🙈

ymao1 · 2024-05-02T13:43:48Z

@elasticmachine merge upstream

kibana-ci · 2024-05-02T15:26:22Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: fcf7db1

Failed CI Steps

FTR Configs #64

Test Failures

[job] [logs] FTR Configs #64 / Action Task Params migrations "before all" hook for "7.16.0 migrates action_task_params to use references array"

Metrics [docs]

✅ unchanged

History

💛 Build #206738 was flaky 70b962e
💔 Build #206730 failed bbf65aa

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

ymao1 · 2024-05-02T18:00:37Z

Did some more testing. The double logs are due to the two different aggregators we create. Each one maintains its own lastResetTime. There are times when the observable interval events occur at slightly less than the interval (for a 60,000 millisecond interval, sometimes the observable will fire at 59,994 milliseconds) which is why without metrics collection, sometimes it won't reset the values but worse case, it'll go 2x the metrics reset interval without resetting.

…reset_interval` has passed since last reset (elastic#182027) The task manager metrics are collected in a monotonically increasing manner and they can be reset one of two ways: (1) When the metrics are collected via the HTTP API (using an external collection mechanism like metricbeat) and (2) when the configured `metrics_reset_interval` has passed. The second way is there as a fail-safe in case the external collection mechanism fails, however there was a bug where the metrics would always get reset when the `metrics_reset_interval` had passed, instead of only if the `metrics_reset_interval` had passed **since the last collection**. This bug could be causing metrics to be unnecessarily reset even if they're regularly being collected (and reset via that mechanism). Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Only resetting metrics if metrics have not been collected for metrics…

70b962e

…_reset_interval

ymao1 force-pushed the task-manager-metrics-reset-bug branch from bbf65aa to 70b962e Compare April 29, 2024 20:01

ymao1 self-assigned this Apr 29, 2024

ymao1 added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0 labels Apr 29, 2024

ymao1 changed the title ~~Only resetting metrics if metrics have not been collected for metrics…~~ [Response Ops][Task Manager] Only resetting TM metrics when metrics_reset_interval has passed since last reset Apr 29, 2024

ymao1 marked this pull request as ready for review April 29, 2024 22:14

ymao1 requested a review from a team as a code owner April 29, 2024 22:14

ymao1 requested review from pmuellr and adcoelho April 29, 2024 22:14

pmuellr approved these changes May 1, 2024

View reviewed changes

Merge branch 'main' into task-manager-metrics-reset-bug

fcf7db1

ymao1 merged commit 8966bee into elastic:main May 2, 2024
37 checks passed

kibanamachine added the backport:skip This commit does not require backporting label May 2, 2024

ymao1 deleted the task-manager-metrics-reset-bug branch May 2, 2024 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Response Ops][Task Manager] Only resetting TM metrics when `metrics_reset_interval` has passed since last reset #182027

[Response Ops][Task Manager] Only resetting TM metrics when `metrics_reset_interval` has passed since last reset #182027

ymao1 commented Apr 29, 2024 •

edited

elasticmachine commented Apr 29, 2024

pmuellr commented May 1, 2024

pmuellr left a comment

ymao1 commented May 1, 2024

ymao1 commented May 2, 2024

kibana-ci commented May 2, 2024

ymao1 commented May 2, 2024

[Response Ops][Task Manager] Only resetting TM metrics when metrics_reset_interval has passed since last reset #182027

[Response Ops][Task Manager] Only resetting TM metrics when metrics_reset_interval has passed since last reset #182027

Conversation

ymao1 commented Apr 29, 2024 • edited

elasticmachine commented Apr 29, 2024

pmuellr commented May 1, 2024

pmuellr left a comment

Choose a reason for hiding this comment

ymao1 commented May 1, 2024

ymao1 commented May 2, 2024

kibana-ci commented May 2, 2024

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

ymao1 commented May 2, 2024

[Response Ops][Task Manager] Only resetting TM metrics when `metrics_reset_interval` has passed since last reset #182027

[Response Ops][Task Manager] Only resetting TM metrics when `metrics_reset_interval` has passed since last reset #182027

ymao1 commented Apr 29, 2024 •

edited