Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] Improve the elasticsearch module when used for Stack Monitoring #39058

Open
consulthys opened this issue Apr 18, 2024 · 0 comments
Labels
Team:Infra Monitoring UI Infrastructure Monitoring UI team Team:Monitoring Stack Monitoring team

Comments

@consulthys
Copy link
Contributor

consulthys commented Apr 18, 2024

While investigating the root cause of indexing failures (also reported here in the past), we discovered that when using Metricbeat to feed Stack Monitoring, the elasticsearch module of Metricbeat ships elasticsearch.shard documents with concrete IDs that are made of the current cluster state (i.e., state_uuid) and some other constant data. Since the cluster state doesn't change at the same pace as Metricbeat collection rounds (10s by default), those version conflicts happen all the time.

Those version conflicts are probably a side-effect of switching to data streams in 8.0.0 (i.e. put if absent semantics with concrete ID) and weren't apparent earlier when the data was stored in simple indexes. Since each elasticsearch.shard document is about a shard placement in the cluster, the logic makes sense, i.e. there's no point re-indexing a document whose content hasn't changed since the last collection round.

However, we could/should go one step further and detect if the cluster state hasn't changed between two collection rounds. I'm naively thinking about "simply" comparing the old and new state_uuid, but it might be more involved than that. Anyway, if there's no change, there's no point in even rebuilding those documents and sending them again, since we know they'll bounce anyway, generate a version conflict and increase the indexing failure counter for no reason. In addition to that, that wastes network bandwidth and CPU/RAM resource on ES side. For big clusters with many thousands of shards, that can make a big difference.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 18, 2024
@cmacknz cmacknz added Team:Monitoring Stack Monitoring team Team:Infra Monitoring UI Infrastructure Monitoring UI team labels Apr 23, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Infra Monitoring UI Infrastructure Monitoring UI team Team:Monitoring Stack Monitoring team
Projects
None yet
Development

No branches or pull requests

2 participants