Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer more expressions in vectorized groupBy. #16338

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Apr 25, 2024

This patch adds a way for columns to provide GroupByVectorColumnSelectors, which controls how the groupBy engine operates on them. This mechanism is used by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector that uses the inputs of an expression as the grouping key. The actual expression evaluation is deferred until the grouped ResultRow is created.

A new context parameter deferExpressionDimensions allows users to control when this deferred selector is used. The default is fixedWidthNonNumeric, which is a behavioral change from the prior behavior. Users can get the prior behavior by setting this to singleString.

Benchmarks of a few selected queries from SqlExpressionBenchmark follow. Findings:

  • Query 26, GROUP BY CONCAT(string2, '-', long2), speeds up when the expression is deferred.
  • Queries 22, 24, 30, and 31 slow down when the expression is deferred. These are GROUP BY TIME_FLOOR(TIMESTAMPADD(DAY, -1, __time), GROUP BY long1 * long2, GROUP BY CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN), and GROUP BY long5 IS NULL, long3 IS NOT NULL. All are simple expressions with numeric inputs and outputs.

For these reasons, I think fixedWidthNonNumeric is a good default.

Benchmark                        (deferExpressionDimensions)  (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt     Score     Error  Units

SqlExpressionBenchmark.querySql                 singleString       22           5000000      auto        force  avgt    5   260.078 ±  14.858  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       22           5000000      auto        force  avgt    5  1970.522 ±  58.400  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       22           5000000      auto        force  avgt    5   263.535 ±   5.549  ms/op
SqlExpressionBenchmark.querySql                       always       22           5000000      auto        force  avgt    5  2021.229 ± 125.010  ms/op

SqlExpressionBenchmark.querySql                 singleString       24           5000000      auto        force  avgt    5   624.300 ±  36.616  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       24           5000000      auto        force  avgt    5   889.836 ±  31.123  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       24           5000000      auto        force  avgt    5   646.920 ±  24.566  ms/op
SqlExpressionBenchmark.querySql                       always       24           5000000      auto        force  avgt    5   890.384 ±  53.748  ms/op

SqlExpressionBenchmark.querySql                 singleString       26           5000000      auto        force  avgt    5   824.417 ±  21.941  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       26           5000000      auto        force  avgt    5   244.232 ±  15.514  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       26           5000000      auto        force  avgt    5   244.598 ±  14.268  ms/op
SqlExpressionBenchmark.querySql                       always       26           5000000      auto        force  avgt    5   248.505 ±   8.004  ms/op

SqlExpressionBenchmark.querySql                 singleString       30           5000000      auto        force  avgt    5   223.687 ±   9.362  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       30           5000000      auto        force  avgt    5   562.844 ±  42.288  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       30           5000000      auto        force  avgt    5   227.850 ±   3.374  ms/op
SqlExpressionBenchmark.querySql                       always       30           5000000      auto        force  avgt    5   562.631 ±  69.408  ms/op

SqlExpressionBenchmark.querySql                 singleString       31           5000000      auto        force  avgt    5   324.208 ±   9.420  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       31           5000000      auto        force  avgt    5  1271.630 ±  87.264  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       31           5000000      auto        force  avgt    5   323.169 ±   6.383  ms/op
SqlExpressionBenchmark.querySql                       always       31           5000000      auto        force  avgt    5  1185.118 ±  34.146  ms/op

This patch adds a way for columns to provide GroupByVectorColumnSelectors,
which controls how the groupBy engine operates on them. This mechanism is used
by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector
that uses the inputs of an expression as the grouping key. The actual expression
evaluation is deferred until the grouped ResultRow is created.

A new context parameter "deferExpressionDimensions" allows users to control when
this deferred selector is used. The default is "fixedWidthNonNumeric", which is a
behavioral change from the prior behavior. Users can get the prior behavior by setting
this to "singleString".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also SqlGroupByBenchmark that benchmarks the code with various distributions and cardinalities. Maybe we should benchmark the code with the string columns and different parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting adding a new @Benchmark method to SqlGroupByBenchmark that uses a SQL query with expressions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants