Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TraceQL] quantile_over_time 2 of 2 - engine #3633

Merged
merged 18 commits into from May 7, 2024

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Apr 30, 2024

What this PR does:
Adds remaining engine and plumbing for quantile_over_time. To dig into a few things:

(1) Calculations and accuracy: This reuses the same approach from the metrics summary api, which creates a histogram based on powers-of-2 buckets, 2,4,8,16, etc for any int64 attribute (duration is considered in nanos). I think this is a good 90%-useful first approach for simplicity. Fairly good for continuous values like Duration, but less so for discrete values like http status code. Long-term this should move towards a "native histogram" approach, but didn't want to introduce that amount of complexity yet.

(2) This PR introduces different variations of a query for each module. I.e. quantile_over_time has to execute 3 different ways between the generator, queriers, and frontend. Previously rate() and count_over_time() were always basic sums, so the combiner in the frontend assumed addition. Not true anymore. The data flow is like this Generator does Spans -> partial histograms -> Querier sums partials across generators -> Frontend sums partials across jobs, and then -> computes final quantiles. Mimir and Loki achieve this through query rewrites example example. For now, this is taking a simpler approach where you pass the AggregateMode to CompileMetricsQuery and it returns a different implementation as needed.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

return
}

// There is an issue where multiple conditions &&'ed on the same
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a pre-existing unrelated bug. Queries asserting multiple conditions on the same attribute like span.http.status_code>=500 && span.http.status_code<600, which should only fetch spans between 500 and 599, actually fetches them all since the conditions are checked independently, and the second pass engine logic filters them to the correct output. It means in this case we can't optimize away the second pass like for other metrics queries. Ideally this is fixed in the Fetch layer and we bring the optimization back, but it wasn't trivial, so better to incur overhead instead of wrong results.

@mdisibio mdisibio marked this pull request as ready for review May 2, 2024 16:22
…log2, for several reasons: support non-log2 or native histogram buckets in the future, and where queriers on different versions may have different buckets during rollout
Start: 1,
End: uint64(10000 * time.Second),
Step: uint64(1 * time.Second),
Start: uint64(1100 * time.Second),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test changes here are due to a change in response combining. Previously the combiner did simple addition by timestamp, so only the timestamps populated by jobs were returned. Now the new SimpleAdditionAggregator and HistogramAggregator start with a zero-slice for all expected time slots in the final request range. The tests had to be restricted to match the test blocks in the test module here, so the output was still readable (not 10,000 zeros). I think this change overall is good but open to discuss. The question boils down to: If you rate() over a single span in the last hour, should it return 1 data point, or fill in the zeros for the whole response?

@mdisibio mdisibio merged commit 8df9670 into grafana:main May 7, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants