Optimize unnecessary column copy for HashAgg #8985

guo-shaoge · 2024-04-25T06:09:22Z

What problem does this PR solve?

Issue Number: close #8891

Problem Summary:
When there are group by key in select item(a.k.a. first_row), tiflash have extra agg func, which cause unnecessary copy from HashMap to Column.

What is changed and how it works?

Basic idea:

Optimization-1 (with collation):
How:

Detect if there is a first_row agg func in the select item.
If so, ignore any agg func. If not, add the 'any' agg func.
Also, set key_from_agg_func to indicate that this key is equivalent to first_row/any agg func, which can avoid copying this key from the HashMap in subsequent operations. (check DAGExpressionAnalyzer::buildAggGroupBy)
If all keys are included in first_row/any (which is rare, but still can happens), will skip copy keys(template argument skip_serialize_key is true)

Optimization-2(no collation)
The above optimization happens when c2 has collation, if c2 has no collation(its type is not string), the process is as follows. So first_row is deleted.

Results:

25% improvement
workload:
1. 20M rows. More rows a different, it means we have very high NDV.
2. 3 varchar columns, 3 decimal columns, 3 int columns (that means Aggregator will use HashMethodSerialized)

before:

after:

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

Signed-off-by: guo-shaoge <shaoge1994@163.com>

ti-chi-bot · 2024-04-25T06:09:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from guo-shaoge, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: guo-shaoge <shaoge1994@163.com>

…licated_agg_func

Signed-off-by: guo-shaoge <shaoge1994@163.com>

…licated_agg_func

guo-shaoge · 2024-04-28T03:52:26Z

/run-all-tests

guo-shaoge · 2024-04-28T04:09:43Z

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge · 2024-04-29T03:34:46Z

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge · 2024-04-29T03:48:44Z

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge · 2024-04-29T08:16:11Z

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge · 2024-04-29T11:53:24Z

/test pull-integration-tes

ti-chi-bot · 2024-04-29T11:53:27Z

@guo-shaoge: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-integration-test
/test pull-unit-test

Use /test all to run all jobs.

In response to this:

/test pull-integration-tes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

guo-shaoge · 2024-04-29T11:53:29Z

/test pull-integration-test

ti-chi-bot · 2024-04-29T11:53:31Z

@guo-shaoge: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test pull-integration-test
/test pull-unit-test

Use /test all to run all jobs.

In response to this:

/test pull-integration-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: guo-shaoge <shaoge1994@163.com>

dbms/src/Interpreters/Aggregator.cpp

SeaRise · 2024-05-06T06:28:23Z

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzer.cpp

@@ -575,6 +588,7 @@ void DAGExpressionAnalyzer::buildAggGroupBy(
            /// need double check this assumption when we support agg with collation
            aggregation_keys.push_back(name);
            agg_key_set.emplace(name);
+            collators.push_back(nullptr);


It seems L591 and L605 have changed the original logic. Was this intended?

Yes, it's expected

So, there was a bug in the previous code?

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzer.cpp

SeaRise · 2024-05-06T06:42:02Z

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzer.cpp

-                    aggregated_columns,
-                    false,
-                    context);
+                auto [first_row_name, first_row_type] = findFirstRow(aggregate_descriptions, name);


Is first_row_type necessary here? Are the type at L600 and first_row_type always the same? If so, can we just use type?

dbms/src/Flash/tests/bench_aggregation_hash_map.cpp

Signed-off-by: guo-shaoge <shaoge1994@163.com>

…licated_agg_func

Signed-off-by: guo-shaoge <shaoge1994@163.com>

…licated_agg_func

dbms/src/Flash/Coprocessor/AggregationInterpreterHelper.cpp

guo-shaoge · 2024-05-08T07:00:47Z

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzer.cpp

@@ -575,6 +588,7 @@ void DAGExpressionAnalyzer::buildAggGroupBy(
            /// need double check this assumption when we support agg with collation
            aggregation_keys.push_back(name);
            agg_key_set.emplace(name);
+            collators.push_back(nullptr);


Yes, it's expected

dbms/src/Flash/tests/bench_aggregation_hash_map.cpp

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzer.cpp

dbms/src/Interpreters/Aggregator.cpp

Signed-off-by: guo-shaoge <shaoge1994@163.com>

dbms/src/Flash/Coprocessor/AggregationInterpreterHelper.cpp

guo-shaoge · 2024-05-15T06:34:04Z

dbms/src/Flash/Coprocessor/AggregationInterpreterHelper.cpp

    const Context & context,
    const Block & before_agg_header,
    size_t before_agg_streams_size,
    size_t agg_streams_size,
    const Names & key_names,
+    const KeyRefAggFuncMap & key_ref_agg_func,


Add comments about the meaning of above two map.

guo-shaoge · 2024-05-15T06:36:31Z

dbms/src/Flash/Coprocessor/AggregationInterpreterHelper.cpp

+    // Before: keys: c1 | c2 | c3
+    // After:  keys: c2 | c1 | c3
+    // By doing this, when deserialize group by keys from HashMap to columns,
+    // we only need to handle c2(convert_key_size == 1) and ignore c1/c3.


Update comment to clarify that only extract partial groupby keys, and others will be skipped and reference to first_row result column.

tmp save

60bd8ea

Signed-off-by: guo-shaoge <shaoge1994@163.com>

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none labels Apr 25, 2024

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 25, 2024

guo-shaoge force-pushed the optimize_duplicated_agg_func branch from 15ff512 to 60bd8ea Compare April 25, 2024 06:18

guo-shaoge added 5 commits April 25, 2024 18:14

refine for spill(enable_skip_serialize_key arg)

a857c53

Signed-off-by: guo-shaoge <shaoge1994@163.com>

Merge branch 'master' of github.com:pingcap/tiflash into optimize_dup…

ee3d867

…licated_agg_func

fmt

fac85e5

Signed-off-by: guo-shaoge <shaoge1994@163.com>

map -> set

f78ddbf

Signed-off-by: guo-shaoge <shaoge1994@163.com>

tidy

b1c98fa

Signed-off-by: guo-shaoge <shaoge1994@163.com>

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Apr 28, 2024

guo-shaoge added 3 commits April 28, 2024 10:59

fix

2f6eb26

Signed-off-by: guo-shaoge <shaoge1994@163.com>

fix

b4d72ea

Signed-off-by: guo-shaoge <shaoge1994@163.com>

Merge branch 'master' of github.com:pingcap/tiflash into optimize_dup…

5b7a84c

…licated_agg_func

guo-shaoge added 6 commits April 28, 2024 15:40

use unordered_map as key_from_agg_func

bd66dec

Signed-off-by: guo-shaoge <shaoge1994@163.com>

reorder collator && fix insert key helper crash

e78cbce

Signed-off-by: guo-shaoge <shaoge1994@163.com>

disable opt for spill process

6ef75b6

Signed-off-by: guo-shaoge <shaoge1994@163.com>

fix prepareBlockAndFill for spill

d009d19

Signed-off-by: guo-shaoge <shaoge1994@163.com>

fmt

881141f

Signed-off-by: guo-shaoge <shaoge1994@163.com>

fix case

9914d0b

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge changed the title ~~Optimize duplicated agg func~~ Optimize unnecessary copy for HashAgg Apr 29, 2024

fix case

76e75fc

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge mentioned this pull request Apr 29, 2024

add bench_aggregator #8941

Closed

12 tasks

guo-shaoge added 2 commits April 29, 2024 13:31

tidy

f0c7302

Signed-off-by: guo-shaoge <shaoge1994@163.com>

refine

d9e8880

Signed-off-by: guo-shaoge <shaoge1994@163.com>

fix case

addd8d3

Signed-off-by: guo-shaoge <shaoge1994@163.com>

fix case

6751837

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge requested review from windtalker and SeaRise April 30, 2024 02:19

guo-shaoge changed the title ~~Optimize unnecessary copy for HashAgg~~ Optimize unnecessary column copy for HashAgg Apr 30, 2024

update comment

ffd5c4e

Signed-off-by: guo-shaoge <shaoge1994@163.com>

SeaRise reviewed May 6, 2024

View reviewed changes

guo-shaoge added 4 commits May 6, 2024 15:24

agg_func_ref_key optimization; gtest

cbf8de5

Signed-off-by: guo-shaoge <shaoge1994@163.com>

append copy column action after agg

c0cb939

Signed-off-by: guo-shaoge <shaoge1994@163.com>

case && fmt

ede3a80

Signed-off-by: guo-shaoge <shaoge1994@163.com>

Merge branch 'master' of github.com:pingcap/tiflash into optimize_dup…

b62ed01

…licated_agg_func

guo-shaoge force-pushed the optimize_duplicated_agg_func branch from 700902d to b62ed01 Compare May 7, 2024 08:32

guo-shaoge added 4 commits May 7, 2024 18:38

fix case

0d0d74b

Signed-off-by: guo-shaoge <shaoge1994@163.com>

fix case by integrate enable_convert_key_optimization into fianl flag

65e33d5

Signed-off-by: guo-shaoge <shaoge1994@163.com>

refine

b0b1df7

Signed-off-by: guo-shaoge <shaoge1994@163.com>

Merge branch 'master' of github.com:pingcap/tiflash into optimize_dup…

7b1c0b8

…licated_agg_func

guo-shaoge commented May 8, 2024

View reviewed changes

guo-shaoge requested a review from SeaRise May 8, 2024 08:02

fmt

2240c9b

Signed-off-by: guo-shaoge <shaoge1994@163.com>

guo-shaoge commented May 13, 2024

View reviewed changes

dbms/src/Flash/Coprocessor/AggregationInterpreterHelper.cpp Outdated Show resolved Hide resolved

Update dbms/src/Flash/Coprocessor/AggregationInterpreterHelper.cpp

9a50f8b

guo-shaoge commented May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize unnecessary column copy for HashAgg #8985

Optimize unnecessary column copy for HashAgg #8985

guo-shaoge commented Apr 25, 2024 •

edited

ti-chi-bot bot commented Apr 25, 2024

guo-shaoge commented Apr 28, 2024

guo-shaoge commented Apr 28, 2024

guo-shaoge commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

ti-chi-bot bot commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

ti-chi-bot bot commented Apr 29, 2024

SeaRise May 6, 2024

guo-shaoge May 8, 2024

SeaRise May 16, 2024

SeaRise May 6, 2024

guo-shaoge May 8, 2024

guo-shaoge May 15, 2024

guo-shaoge May 15, 2024 •

edited

Optimize unnecessary column copy for HashAgg #8985

Are you sure you want to change the base?

Optimize unnecessary column copy for HashAgg #8985

Conversation

guo-shaoge commented Apr 25, 2024 • edited

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

ti-chi-bot bot commented Apr 25, 2024

guo-shaoge commented Apr 28, 2024

guo-shaoge commented Apr 28, 2024

guo-shaoge commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

ti-chi-bot bot commented Apr 29, 2024

guo-shaoge commented Apr 29, 2024

ti-chi-bot bot commented Apr 29, 2024

SeaRise May 6, 2024

Choose a reason for hiding this comment

guo-shaoge May 8, 2024

Choose a reason for hiding this comment

SeaRise May 16, 2024

Choose a reason for hiding this comment

SeaRise May 6, 2024

Choose a reason for hiding this comment

guo-shaoge May 8, 2024

Choose a reason for hiding this comment

guo-shaoge May 15, 2024

Choose a reason for hiding this comment

guo-shaoge May 15, 2024 • edited

Choose a reason for hiding this comment

guo-shaoge commented Apr 25, 2024 •

edited

guo-shaoge May 15, 2024 •

edited