Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize unnecessary column copy for HashAgg #8985

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

guo-shaoge
Copy link
Contributor

@guo-shaoge guo-shaoge commented Apr 25, 2024

What problem does this PR solve?

Issue Number: close #8891

Problem Summary:
When there are group by key in select item(a.k.a. first_row), tiflash have extra agg func, which cause unnecessary copy from HashMap to Column.

What is changed and how it works?

Basic idea:
image

Optimization-1 (with collation):
How:

  1. Detect if there is a first_row agg func in the select item.
  2. If so, ignore any agg func. If not, add the 'any' agg func.
  3. Also, set key_from_agg_func to indicate that this key is equivalent to first_row/any agg func, which can avoid copying this key from the HashMap in subsequent operations. (check DAGExpressionAnalyzer::buildAggGroupBy)
  4. If all keys are included in first_row/any (which is rare, but still can happens), will skip copy keys(template argument skip_serialize_key is true)

Optimization-2(no collation)
The above optimization happens when c2 has collation, if c2 has no collation(its type is not string), the process is as follows. So first_row is deleted.

Results:

  1. 25% improvement
  2. workload:
    1. 20M rows. More rows a different, it means we have very high NDV.
    2. 3 varchar columns, 3 decimal columns, 3 int columns (that means Aggregator will use HashMethodSerialized)

before:
image

after:
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Copy link
Contributor

ti-chi-bot bot commented Apr 25, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from guo-shaoge, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 25, 2024
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Contributor Author

/run-all-tests

@guo-shaoge
Copy link
Contributor Author

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge changed the title Optimize duplicated agg func Optimize unnecessary copy for HashAgg Apr 29, 2024
@guo-shaoge
Copy link
Contributor Author

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Contributor Author

/test all

@guo-shaoge guo-shaoge mentioned this pull request Apr 29, 2024
12 tasks
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Contributor Author

/test all

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge
Copy link
Contributor Author

/test pull-integration-tes

Copy link
Contributor

ti-chi-bot bot commented Apr 29, 2024

@guo-shaoge: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-integration-test
  • /test pull-unit-test

Use /test all to run all jobs.

In response to this:

/test pull-integration-tes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@guo-shaoge
Copy link
Contributor Author

/test pull-integration-test

Copy link
Contributor

ti-chi-bot bot commented Apr 29, 2024

@guo-shaoge: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-integration-test
  • /test pull-unit-test

Use /test all to run all jobs.

In response to this:

/test pull-integration-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge changed the title Optimize unnecessary copy for HashAgg Optimize unnecessary column copy for HashAgg Apr 30, 2024
Signed-off-by: guo-shaoge <shaoge1994@163.com>
dbms/src/Interpreters/Aggregator.cpp Outdated Show resolved Hide resolved
@@ -575,6 +588,7 @@ void DAGExpressionAnalyzer::buildAggGroupBy(
/// need double check this assumption when we support agg with collation
aggregation_keys.push_back(name);
agg_key_set.emplace(name);
collators.push_back(nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems L591 and L605 have changed the original logic. Was this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's expected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there was a bug in the previous code?

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzer.cpp Outdated Show resolved Hide resolved
aggregated_columns,
false,
context);
auto [first_row_name, first_row_type] = findFirstRow(aggregate_descriptions, name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is first_row_type necessary here? Are the type at L600 and first_row_type always the same? If so, can we just use type?

Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@guo-shaoge guo-shaoge force-pushed the optimize_duplicated_agg_func branch from 700902d to b62ed01 Compare May 7, 2024 08:32
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
Signed-off-by: guo-shaoge <shaoge1994@163.com>
@@ -575,6 +588,7 @@ void DAGExpressionAnalyzer::buildAggGroupBy(
/// need double check this assumption when we support agg with collation
aggregation_keys.push_back(name);
agg_key_set.emplace(name);
collators.push_back(nullptr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's expected

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzer.cpp Outdated Show resolved Hide resolved
dbms/src/Interpreters/Aggregator.cpp Outdated Show resolved Hide resolved
@guo-shaoge guo-shaoge requested a review from SeaRise May 8, 2024 08:02
Signed-off-by: guo-shaoge <shaoge1994@163.com>
const Context & context,
const Block & before_agg_header,
size_t before_agg_streams_size,
size_t agg_streams_size,
const Names & key_names,
const KeyRefAggFuncMap & key_ref_agg_func,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments about the meaning of above two map.

// Before: keys: c1 | c2 | c3
// After: keys: c2 | c1 | c3
// By doing this, when deserialize group by keys from HashMap to columns,
// we only need to handle c2(convert_key_size == 1) and ignore c1/c3.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to clarify that only extract partial groupby keys, and others will be skipped and reference to first_row result column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid duplicated first_row agg func
2 participants