optimize count(1) performance on hive/iceberg table #45242

dirtysalt · 2024-05-07T17:04:02Z

[Feature] Optimize count(1) in hdfs scanner by rewriting plan to sum (backport #43616) #45622
[Feature] Optimize count(1) in hdfs scanner by rewriting plan to sum (backport #43616) #45618
[Feature] Optimize count(1) in hdfs scanner by rewriting plan to sum #43616

Enhancement

The text was updated successfully, but these errors were encountered:

…um` (#43616) Why I'm doing: Rigjht now hdfs scanner optimization on count(1) is to output const column of expected count. And we can see in extreme case(large dataset), the chunk number flows in pipeline will be extremely huge, and operator time and overhead time is not neglectable. And here is a profile of select count(*) from hive.hive_ssb100g_parquet.lineorder. To reproduce this extreme case, I've changed code to scale morsels by 20x and repeat row groups by 10x. in concurrency=1 case , total time is 51s - OverheadTime: 25s37ms - __MAX_OF_OverheadTime: 25s111ms - __MIN_OF_OverheadTime: 24s962ms - PullTotalTime: 12s376ms - __MAX_OF_PullTotalTime: 13s147ms - __MIN_OF_PullTotalTime: 11s885ms What I'm doing: Rewrite the count(1) query to sum like. So each row group reader will only emit at one chunk(size = 1). And total time is 9s. Original plan is like +----------------------------------+ | Explain String | +----------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:18: count | | PARTITION: UNPARTITIONED | | | | RESULT SINK | | | | 4:AGGREGATE (merge finalize) | | | output: count(18: count) | | | group by: | | | | | 3:EXCHANGE | | | | PLAN FRAGMENT 1 | | OUTPUT EXPRS: | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 03 | | UNPARTITIONED | | | | 2:AGGREGATE (update serialize) | | | output: count(*) | | | group by: | | | | | 1:Project | | | <slot 20> : 1 | | | | | 0:HdfsScanNode | | TABLE: lineorder | | partitions=1/1 | | cardinality=600037902 | | avgRowSize=5.0 | +----------------------------------+ And rewritted plan is like +-----------------------------------+ | Explain String | +-----------------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:18: count | | PARTITION: UNPARTITIONED | | | | RESULT SINK | | | | 3:AGGREGATE (merge finalize) | | | output: sum(18: count) | | | group by: | | | | | 2:EXCHANGE | | | | PLAN FRAGMENT 1 | | OUTPUT EXPRS: | | PARTITION: RANDOM | | | | STREAM DATA SINK | | EXCHANGE ID: 02 | | UNPARTITIONED | | | | 1:AGGREGATE (update serialize) | | | output: sum(19: ___count___) | | | group by: | | | | | 0:HdfsScanNode | | TABLE: lineorder | | partitions=1/1 | | cardinality=1 | | avgRowSize=1.0 | +-----------------------------------+ Fixes #45242 Signed-off-by: yanz <dirtysalt1987@gmail.com> (cherry picked from commit b6ca919) # Conflicts: # java-extensions/hive-reader/src/main/java/com/starrocks/hive/reader/HiveScanner.java # test/sql/test_iceberg/R/test_iceberg_catalog # test/sql/test_iceberg/T/test_iceberg_catalog

dirtysalt added the type/enhancement Make an enhancement to StarRocks label May 7, 2024

dirtysalt mentioned this issue May 7, 2024

[Feature] Optimize count(1) in hdfs scanner by rewriting plan to sum #43616

Merged

24 tasks

imay closed this as completed in #43616 May 10, 2024

wanpengfei-git reopened this May 10, 2024

wanpengfei-git added the version:3.4 label May 10, 2024

mergify bot mentioned this issue May 14, 2024

[Feature] Optimize count(1) in hdfs scanner by rewriting plan to sum (backport #43616) #45618

Merged

42 tasks

wanpengfei-git added the version:3.3.0 label May 14, 2024

dirtysalt mentioned this issue May 14, 2024

[Feature] Optimize count(1) in hdfs scanner by rewriting plan to sum (backport #43616) #45622

Merged

24 tasks

wanpengfei-git added the version:3.2.7 label May 14, 2024

dirtysalt closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize count(1) performance on hive/iceberg table #45242

optimize count(1) performance on hive/iceberg table #45242

dirtysalt commented May 7, 2024 •

edited by wanpengfei-git

optimize count(1) performance on hive/iceberg table #45242

optimize count(1) performance on hive/iceberg table #45242

Comments

dirtysalt commented May 7, 2024 • edited by wanpengfei-git

Enhancement

dirtysalt commented May 7, 2024 •

edited by wanpengfei-git