[TraceQL] quantile_over_time 2 of 2 - engine #3633

mdisibio · 2024-04-30T19:31:42Z

What this PR does:
Adds remaining engine and plumbing for quantile_over_time. To dig into a few things:

(1) Calculations and accuracy: This reuses the same approach from the metrics summary api, which creates a histogram based on powers-of-2 buckets, 2,4,8,16, etc for any int64 attribute (duration is considered in nanos). I think this is a good 90%-useful first approach for simplicity. Fairly good for continuous values like Duration, but less so for discrete values like http status code. Long-term this should move towards a "native histogram" approach, but didn't want to introduce that amount of complexity yet.

(2) This PR introduces different variations of a query for each module. I.e. quantile_over_time has to execute 3 different ways between the generator, queriers, and frontend. Previously rate() and count_over_time() were always basic sums, so the combiner in the frontend assumed addition. Not true anymore. The data flow is like this Generator does Spans -> partial histograms -> Querier sums partials across generators -> Frontend sums partials across jobs, and then -> computes final quantiles. Mimir and Loki achieve this through query rewrites example example. For now, this is taking a simpler approach where you pass the AggregateMode to CompileMetricsQuery and it returns a different implementation as needed.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

… of metrics and hints

… end. Refactor label handling to be shared. Validate quantiles, support quantile_over_time by any integer attribute

…t label

…conditions on the same attribute

…oss blocks, and across generators. Error handling, move code around

…le_over_time test, code cleanup

mdisibio · 2024-05-02T14:23:10Z

pkg/traceql/engine_metrics.go

+		return
+	}
+
+	// There is an issue where multiple conditions &&'ed on the same


This fixes a pre-existing unrelated bug. Queries asserting multiple conditions on the same attribute like span.http.status_code>=500 && span.http.status_code<600, which should only fetch spans between 500 and 599, actually fetches them all since the conditions are checked independently, and the second pass engine logic filters them to the correct output. It means in this case we can't optimize away the second pass like for other metrics queries. Ideally this is fixed in the Fetch layer and we bring the optimization back, but it wasn't trivial, so better to incur overhead instead of wrong results.

modules/frontend/metrics_query_range_handler.go

pkg/traceql/ast.go

…log2, for several reasons: support non-log2 or native histogram buckets in the future, and where queriers on different versions may have different buckets during rollout

mdisibio · 2024-05-06T12:35:13Z

modules/frontend/metrics_query_range_handler_test.go

-		Start: 1,
-		End:   uint64(10000 * time.Second),
-		Step:  uint64(1 * time.Second),
+		Start: uint64(1100 * time.Second),


The test changes here are due to a change in response combining. Previously the combiner did simple addition by timestamp, so only the timestamps populated by jobs were returned. Now the new SimpleAdditionAggregator and HistogramAggregator start with a zero-slice for all expected time slots in the final request range. The tests had to be restricted to match the test blocks in the test module here, so the output was still readable (not 10,000 zeros). I think this change overall is good but open to discuss. The question boils down to: If you rate() over a single span in the last hour, should it return 1 data point, or fill in the zeros for the whole response?

mdisibio added 16 commits April 24, 2024 09:26

Add lang/parser support for quantile_over_time, fix missing stringify…

50a6984

… of metrics and hints

First working draft of quantile_over_time implementation

7307d4c

Validate the query in the frontend

589d83f

Histogram accumulate jobs by bucket as they come in instead of at the…

6abee43

… end. Refactor label handling to be shared. Validate quantiles, support quantile_over_time by any integer attribute

Fix language definition to allow both floats or ints for quantiles

923d7d8

Remove layer of proto->seriesset conversion. Fix roundtrip of __bucke…

2ccc75f

…t label

Rename to SimpleAdditionCombiner, slight interval calc cleanup

e320bbb

Fix p0 returning 1 instead of minimum value, comments cleanup

b845743

Rename frontend param and fix handling of ints

ee0e5bc

Fix pre-existing bug in metrics optimization when asserting multiple …

4c1db95

…conditions on the same attribute

Fix to support 3 flavors of the metrics pipeline: query-frontend, acr…

656f525

…oss blocks, and across generators. Error handling, move code around

Merge branch 'main' into quantile-engine

81deb1b

Update query_range frontend test for new behavior

cb31ed0

Consolidate histogram code between traceql and traceqlmetrics. quanti…

b49d994

…le_over_time test, code cleanup

lint

05c3af4

changelog

dbe0a18

mdisibio commented May 2, 2024

View reviewed changes

mdisibio marked this pull request as ready for review May 2, 2024 16:22

mdisibio requested review from joe-elliott, annanay25, mapno, kvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners May 2, 2024 16:22

mdisibio mentioned this pull request May 2, 2024

[TraceQL Metrics] histogram_over_time #3644

Merged

3 tasks

joe-elliott reviewed May 2, 2024

View reviewed changes

modules/frontend/metrics_query_range_handler.go Show resolved Hide resolved

pkg/traceql/ast.go Show resolved Hide resolved

mdisibio added 2 commits May 3, 2024 12:31

Redo histograms to set __bucket label to the actual value instead of …

5c79dc2

…log2, for several reasons: support non-log2 or native histogram buckets in the future, and where queriers on different versions may have different buckets during rollout

Revert all changes to traceqlmetrics package, was getting too noisy

3c8507f

mdisibio commented May 6, 2024

View reviewed changes

joe-elliott approved these changes May 6, 2024

View reviewed changes

mapno approved these changes May 7, 2024

View reviewed changes

mdisibio merged commit 8df9670 into grafana:main May 7, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TraceQL] quantile_over_time 2 of 2 - engine #3633

[TraceQL] quantile_over_time 2 of 2 - engine #3633

mdisibio commented Apr 30, 2024 •

edited

mdisibio May 2, 2024

mdisibio May 6, 2024

[TraceQL] quantile_over_time 2 of 2 - engine #3633

[TraceQL] quantile_over_time 2 of 2 - engine #3633

Conversation

mdisibio commented Apr 30, 2024 • edited

mdisibio May 2, 2024

Choose a reason for hiding this comment

mdisibio May 6, 2024

Choose a reason for hiding this comment

mdisibio commented Apr 30, 2024 •

edited