Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Task Manager] Only resetting TM metrics when metrics_reset_interval has passed since last reset #182027

Merged
merged 2 commits into from May 2, 2024

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Apr 29, 2024

The task manager metrics are collected in a monotonically increasing manner and they can be reset one of two ways: (1) When the metrics are collected via the HTTP API (using an external collection mechanism like metricbeat) and (2) when the configured metrics_reset_interval has passed. The second way is there as a fail-safe in case the external collection mechanism fails, however there was a bug where the metrics would always get reset when the metrics_reset_interval had passed, instead of only if the metrics_reset_interval had passed since the last collection. This bug could be causing metrics to be unnecessarily reset even if they're regularly being collected (and reset via that mechanism).

@ymao1 ymao1 force-pushed the task-manager-metrics-reset-bug branch from bbf65aa to 70b962e Compare April 29, 2024 20:01
@ymao1 ymao1 self-assigned this Apr 29, 2024
@ymao1 ymao1 added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0 labels Apr 29, 2024
@ymao1 ymao1 changed the title Only resetting metrics if metrics have not been collected for metrics… [Response Ops][Task Manager] Only resetting TM metrics when metrics_reset_interval has passed since last reset Apr 29, 2024
@ymao1 ymao1 marked this pull request as ready for review April 29, 2024 22:14
@ymao1 ymao1 requested a review from a team as a code owner April 29, 2024 22:14
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1 ymao1 requested review from pmuellr and adcoelho April 29, 2024 22:14
@pmuellr
Copy link
Member

pmuellr commented May 1, 2024

I instrumented the new code to see when it was calculating whether to reset or not, and saw some interesting results:

2024-05-01T16:34:01.765Z tm metrics: resetting because interval has passed
2024-05-01T16:34:01.769Z tm metrics: not resetting because interval has not passed

2024-05-01T16:34:31.765Z tm metrics: not resetting because interval has not passed
2024-05-01T16:34:31.769Z tm metrics: resetting because interval has passed

2024-05-01T16:35:01.764Z tm metrics: resetting because interval has passed
2024-05-01T16:35:01.770Z tm metrics: resetting because interval has passed

2024-05-01T16:35:31.765Z tm metrics: resetting because interval has passed
2024-05-01T16:35:31.769Z tm metrics: not resetting because interval has not passed

The two calls are I assume from the double polls (everything-but-reports, then reports). Curious that most of the time the first one resets, the second one doesn't. \o/. But sometimes not, like at 16:35:01.

I don't think this is an issue with this code, but is interesting!

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Per other comment in the PR, the "double path" through here because of the double polls pretty much works, though I guess there's a bit of a race condition there when it resets on both. I don't think that's a problem - resetting twice so close together should pretty much be a no-op, right? Or would we drop something the first one might have sent?

Beyond that, I let it run, and then did some metric?result=true calls, which I could force at various times by watching the logs. Worked fine!

@ymao1
Copy link
Contributor Author

ymao1 commented May 1, 2024

double polls

I'm not sure what you mean by the double polls...where did you add the log messages? Wondering if there's another bug in this code? 🙈

@ymao1
Copy link
Contributor Author

ymao1 commented May 2, 2024

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #64 / Action Task Params migrations "before all" hook for "7.16.0 migrates action_task_params to use references array"

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@ymao1
Copy link
Contributor Author

ymao1 commented May 2, 2024

Did some more testing. The double logs are due to the two different aggregators we create. Each one maintains its own lastResetTime. There are times when the observable interval events occur at slightly less than the interval (for a 60,000 millisecond interval, sometimes the observable will fire at 59,994 milliseconds) which is why without metrics collection, sometimes it won't reset the values but worse case, it'll go 2x the metrics reset interval without resetting.

@ymao1 ymao1 merged commit 8966bee into elastic:main May 2, 2024
37 checks passed
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label May 2, 2024
@ymao1 ymao1 deleted the task-manager-metrics-reset-bug branch May 2, 2024 18:00
yuliacech pushed a commit to yuliacech/kibana that referenced this pull request May 3, 2024
…reset_interval` has passed since last reset (elastic#182027)

The task manager metrics are collected in a monotonically increasing
manner and they can be reset one of two ways: (1) When the metrics are
collected via the HTTP API (using an external collection mechanism like
metricbeat) and (2) when the configured `metrics_reset_interval` has
passed. The second way is there as a fail-safe in case the external
collection mechanism fails, however there was a bug where the metrics
would always get reset when the `metrics_reset_interval` had passed,
instead of only if the `metrics_reset_interval` had passed **since the
last collection**. This bug could be causing metrics to be unnecessarily
reset even if they're regularly being collected (and reset via that
mechanism).

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.15.0
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants