Enhancement: Sketch value aggregator performance #13020

davecromberge · 2024-04-29T10:24:18Z

This addresses excessive resource consumption and performance problems with the following Apache Datasketches value aggregators:

Theta
CPC
Tuple (Integer)

When merging sketches via union, it is often a costly operation that creates intermediate book-keeping structures. In addition, the operation performs better when the cost is amortized over many sketch inputs instead of pairwise merges on two sketches at a time.

When tested in a production environment, I measured a median reduction of 10x in the time spent creating StarTrees.

Related to issue:
#13019

release-notes

Signature changes to Theta, Tuple and CPC value aggregators; now use Object as the aggregated value type

Use union as intermediate state instead of Sketch.

codecov-commenter · 2024-04-29T10:59:34Z

Codecov Report

Attention: Patch coverage is 0% with 129 lines in your changes are missing coverage. Please review.

Project coverage is 35.05%. Comparing base (59551e4) to head (a504757).
Report is 375 commits behind head on master.

Files	Patch %	Lines
...gator/DistinctCountThetaSketchValueAggregator.java	0.00%	52 Missing ⚠️
...regator/DistinctCountCPCSketchValueAggregator.java	0.00%	40 Missing ⚠️
.../aggregator/IntegerTupleSketchValueAggregator.java	0.00%	37 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #13020       +/-   ##
=============================================
- Coverage     61.75%   35.05%   -26.70%     
+ Complexity      207        6      -201     
=============================================
  Files          2436     2427        -9     
  Lines        133233   133089      -144     
  Branches      20636    20631        -5     
=============================================
- Hits          82274    46653    -35621     
- Misses        44911    83044    +38133     
+ Partials       6048     3392     -2656

Flag	Coverage Δ
custom-integration1	`?`
integration	`<0.01% <0.00%> (-0.01%)`	⬇️
integration1	`<0.01% <0.00%> (-0.01%)`	⬇️
integration2	`0.00% <0.00%> (ø)`
java-11	`0.00% <0.00%> (-61.71%)`	⬇️
java-21	`35.05% <0.00%> (-26.58%)`	⬇️
skip-bytebuffers-false	`35.04% <0.00%> (-26.71%)`	⬇️
skip-bytebuffers-true	`35.01% <0.00%> (+7.28%)`	⬆️
temurin	`35.05% <0.00%> (-26.70%)`	⬇️
unittests	`46.57% <0.00%> (-15.18%)`	⬇️
unittests1	`46.57% <0.00%> (-0.32%)`	⬇️
unittests2	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…tches

gortiz · 2024-04-30T09:25:09Z

...ava/org/apache/pinot/segment/local/aggregator/DistinctCountCPCSketchValueAggregatorTest.java

-    CpcSketch input = new CpcSketch();
-    IntStream.range(0, 1000).forEach(input::update);
+    CpcSketch input = new CpcSketch(12);
+    IntStream.range(0, 100).forEach(input::update);


What is the impact of this change? decreasing test initialization time?

The sketch estimate was off by one on the previous configuration. Rather than introducing a delta for the variable error, I decreased the number of inputs making the test more deterministic.

gortiz · 2024-04-30T09:27:30Z

The PR looks fine to me, but it is a bit strange to approve a performance wise PR without an actual benchmark we can reproduce. Could you include a JMH benchmark as part of the PR?

davecromberge · 2024-05-01T07:50:05Z

The PR looks fine to me, but it is a bit strange to approve a performance wise PR without an actual benchmark we can reproduce. Could you include a JMH benchmark as part of the PR?

Sure, that is a fair request. I would propose running the StarTree test under a JMH benchmark under previous and current configuration and including the results in this PR. How does that sound? Or, are you looking for a JMH test as part of the repo please could you point me to an existing test that would serve as a good example.

davecromberge · 2024-05-01T07:54:58Z

@gortiz as a starting point, here are some empirical measurements that I took from a run in our production environment. The build time refers to the StarTree build time for a segment and the number of records refers to the number of records to aggregate.

The median differs by a factor of 10.

davecromberge added 4 commits April 29, 2024 11:18

Performance improvements: Theta sketch value aggregator

4ee591e

Use union as intermediate state instead of Sketch.

Performance improvements: Integer Tuple sketch value aggregator

d494533

Use union as intermediate state instead of Sketch.

Performance improvements: CPC sketch value aggregator

3b89eeb

Use union as intermediate state instead of Sketch.

Fix checkstyle violations

5c0cf24

davecromberge marked this pull request as draft April 29, 2024 11:10

davecromberge added 2 commits April 29, 2024 12:44

Update test max capacity values

191b5ad

Incorporate theta error rate into startree tests for Theta family ske…

0b24c22

…tches

davecromberge marked this pull request as ready for review April 29, 2024 13:02

davecromberge added 2 commits April 29, 2024 16:40

Use estimated byte size capacity at 2^14

3e13303

Adjust for CPC estimation error

a504757

Jackie-Jiang added release-notes Referenced by PRs that need attention when compiling the next release notes performance labels Apr 29, 2024

gortiz reviewed Apr 30, 2024

View reviewed changes

Jackie-Jiang approved these changes May 1, 2024

View reviewed changes

Jackie-Jiang merged commit ad70686 into apache:master May 1, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Sketch value aggregator performance #13020

Enhancement: Sketch value aggregator performance #13020

davecromberge commented Apr 29, 2024 •

edited

codecov-commenter commented Apr 29, 2024 •

edited

gortiz Apr 30, 2024

davecromberge May 1, 2024

gortiz commented Apr 30, 2024

davecromberge commented May 1, 2024 •

edited

davecromberge commented May 1, 2024

Enhancement: Sketch value aggregator performance #13020

Enhancement: Sketch value aggregator performance #13020

Conversation

davecromberge commented Apr 29, 2024 • edited

codecov-commenter commented Apr 29, 2024 • edited

Codecov Report

gortiz Apr 30, 2024

Choose a reason for hiding this comment

davecromberge May 1, 2024

Choose a reason for hiding this comment

gortiz commented Apr 30, 2024

davecromberge commented May 1, 2024 • edited

davecromberge commented May 1, 2024

davecromberge commented Apr 29, 2024 •

edited

codecov-commenter commented Apr 29, 2024 •

edited

davecromberge commented May 1, 2024 •

edited