[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597

GideonPotok · 2024-05-15T13:42:10Z

What changes were proposed in this pull request?

Pull requests

Scala TreeMap (RB Tree)
GroupMapReduce <- Most performant
GroupMapReduce (Cleaned up) (This PR) <- Most performant
Comparing Experimental Approaches

Central Change to Mode `eval` Algorithm:

Update to eval Method: The eval method now checks if the column being looked at is string with non-default collation and if so, uses a grouping

buff.toSeq.groupMapReduce {
        case (key: String, _) =>
          CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
        case (key: UTF8String, _) =>
          CollationFactory.getCollationKey(key, collationId)
        case (key, _) => key
      }(x => x)((x, y) => (x._1, x._2 + y._2)).values

Minor Change to Mode:

Introduction of collationId: A new lazy value collationId is computed from the dataType of the child expression, used to fetch the appropriate collation comparator when collationEnabled is true.

This PR will fail for complex types containing collated strings
Follow up PR will implement that

Unit Test Enhancements: Significant additions to `CollationStringExpressionsSuite` to test new functionality including:

Tests for the Mode function when handling strings with different collation settings.

Benchmark Updates:

Enhanced the CollationBenchmark classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.

Why are the changes needed?

Ensures consistency in handling string comparisons under various collation settings.
Improves global usability by enabling compatibility with different collation standards.

Does this PR introduce any user-facing change?

Yes, this PR introduces the following user-facing changes:

Adds a new collationEnabled property to the Mode expression.
Users can now specify collation settings for the Mode expression to customize its behavior.

How was this patch tested?

This patch was tested through a combination of new and existing unit and end-to-end SQL tests.

Unit Tests:
- CollationStringExpressionsSuite:
  - Make the newly added tests more in the same design pattern as the existing tests
- Added multiple test cases to verify that the Mode function correctly handles strings with different collation settings.

Out of scope: Special Unicode Cases higher planes

Tests do not need to include Null Handling.

Benchmark Tests:
Manual Testing:

 ./build/mvn -DskipTests clean package 
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
   spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
    import org.apache.spark.sql.types.StringType
    import org.apache.spark.sql.functions
    import spark.implicits._
    val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
    val df = data.toDF("word")
    val dfLC = df.withColumn("word",
      col("word").cast(StringType("UTF8_BINARY_LCASE")))
    val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
    dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
|  abc|
+-----+

AFTER:
+-----+
|count|
+-----+
|  Def|
+-----+

*/

Continuous Integration (CI):
- The patch passed all relevant Continuous Integration (CI) checks, including:
  - Unit test suite
  - Benchmark suite
  - Consider moving the new benchmark to the catalyst module

Was this patch authored or co-authored using generative AI tooling?

Nope!

sql/core/benchmarks/CollationBenchmark-results.txt

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt

sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

GideonPotok · 2024-05-17T13:52:47Z

@uros-db This is all cleaned up. Let's get some of the other reviewers to look at it?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

uros-db

since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?

GideonPotok · 2024-05-17T20:23:45Z

since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?

In my local tests, I found that Mode performs a byte-by-byte comparison for structs, which does not consider collation. So that is still outstanding. Good catch!

@uros-db There are several strategies we might adopt to handle structs with collation fields. I am looking into implementations. It is potentially straightforward though have some gotchas.

Do you feel I should solve for that in a separate PR or in this one? I assume you prefer that this get solve in this PR and not a follow-up PR, right?

GideonPotok · 2024-05-18T20:07:55Z

@uros-db

I have added implementation for mode to support structs with fields with the various collations. Performance is not great, so far.

[info] collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ---------------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE - mode - 30105 elements                     31             32           1          9.8         102.3       1.0X
[info] UNICODE - mode - 30105 elements                                1              1           0        240.4           4.2      24.6X
[info] UTF8_BINARY - mode - 30105 elements                            1              1           0        239.1           4.2      24.5X
[info] UNICODE_CI - mode - 30105 elements                            57             59           2          5.3         189.9       0.5X

I will add the benchmark results from GHA once I get your feedback.

I haven;t yet added support for collation for mode on array types, as in the "Collation Support in Spark" design doc, it says support for that is TBD. So I wanted to check in as to whether you think I should add support for that now or as a followup.

GideonPotok · 2024-05-18T20:19:02Z

What I would really like to try is to move from this implementation to an approach that will have the collation-support logic moved to the PartialAggregation stage, by moving logic to Mode.merge and Mode.update. I would use a modified open hash map for that with hashing based on the collation key and with a separate map to map from collation key to one of the actual values observed that maps to that collation key (which experimentation has shown could work).

But as it has already been a couple weeks of development on this, I believe we should, for this PR, confine all the collation logic in the stage that can't be serialized and deserialized -- the eval stage. And I should try what I have described above in a PR raised after we have merged the approach that has already been tested (i.e. this PR).

uros-db · 2024-05-19T11:54:17Z

I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs.

However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc.

With that in mind, it seems that we should adopt one of two approaches:

implement the support for collated StringType in this PR, but fail (throw exception) for complex types that have collated strings
implement full support at once

uros-db · 2024-05-19T11:58:53Z

also note that covering StringTypes which are fields of StructType is not by itself enough - suppose there's a field of StructType that is another StructType that has a field of collated StringType, etc.

same goes for arrays, handling ArrayType(StringType) is not enough by itself - we also need ArrayType(ArrayType(StringType))

in short, I would say that we need a recursive approach to properly handle all possible collated string instances

uros-db · 2024-05-19T12:12:36Z

As for changing how Mode.update works in order to inject collationKey, I think that should be enough to do the trick? it seems that Mode.merge should then work by default

but then of course there's the problem of preserving one of the actual values - you correctly noticed that we can't just return collationKey, as that value might not be present in the original array

I suppose a separate map might do the trick here (mapping collationKey to original string value), and since we don't have preference towards which value gets returned, simply returning the first one that appeared is considered correct behaviour

GideonPotok · 2024-05-19T12:17:12Z

I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs.

However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc.

With that in mind, it seems that we should adopt one of two approaches:

implement the support for collated StringType in this PR, but fail (throw exception) for complex types that have collated strings

implement full support at once

@uros-db if you are fine with me splitting it into two PRs, that's what I will do! I will modify this PR to fail for complex types that have collated strings. And I will get the PR to implement full (recursive) support for said complex types ready to be reviewed right after this one is merged. I appreciate your flexibility!

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

GideonPotok · 2024-05-21T20:50:02Z

@uros-db I have made changes for all but your latest suggestion (re whitelists -- will add that soon)

latest review added checkinputdatatype to not support complex types containing nonbinary collations added checkinputdatatype to not support complex types containing nonbinary collations added struct test stuff Tests pass test structs fix scalastyle Collation Support for Mode

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala

GideonPotok · 2024-05-24T16:16:23Z

@uros-db Should I also add collation support to org.apache.spark.sql.catalyst.expressions.aggregate.PandasMode?

The only difference will be

Support for null keys (thus StringType won't necessarily mean all values in buffer are UTF8String, some might just be null, right?)
PandasMode returns a list of all values that are tied for mode. In that case, should all the values be present? Eg if you have the pandas_mode of ['a', 'a', 'a', 'b', 'b', 'B'], with utf_binary_lcase collation, what do you think pandas_mode should return? If we want to support PandasMode, I can do a little research on what other databases seem to favor for this type of question.

…essions/aggregate/Mode.scala Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>

GideonPotok · 2024-05-26T14:23:52Z

@uros-db ?

uros-db · 2024-05-28T09:59:56Z

We can leave PandasMode for a separate PR, but we'll definitely need to take care of it at one point

now that you've explored various options and finished the groupMapReduce approach, I think should can call in other SQL team reviewers to take a look at this and provide their feedback: @dbatomic @nikolamand-db @stefankandic @stevomitric

sql/core/benchmarks/CollationBenchmark-results.txt

GideonPotok · 2024-05-30T13:40:01Z

@dbatomic @nikolamand-db @stefankandic @stevomitric bump

GideonPotok · 2024-05-31T14:09:06Z

@uros-db when should I add back support for complex types? Should i wait until we have buy-in for the current approach from @dbatomic @nikolamand-db @stefankandic @stevomitric or should I do it now ?

GideonPotok · 2024-05-31T14:09:14Z

(I no longer think the code for support for complex types needs to be a seperate PR. )

github-actions bot added the SQL label May 15, 2024

GideonPotok force-pushed the spark_47353_3_clean branch 4 times, most recently from 01c6706 to 365e639 Compare May 15, 2024 17:02

GideonPotok marked this pull request as ready for review May 16, 2024 21:22

GideonPotok force-pushed the spark_47353_3_clean branch from 63d22f2 to 3758e43 Compare May 16, 2024 23:20

GideonPotok changed the title ~~[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2]~~ [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] May 16, 2024

GideonPotok force-pushed the spark_47353_3_clean branch 3 times, most recently from 9329234 to ec22116 Compare May 16, 2024 23:34

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-results.txt Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-jdk21-results.txt Outdated Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala Show resolved Hide resolved

GideonPotok commented May 17, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala Show resolved Hide resolved

uros-db reviewed May 17, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 17, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 17, 2024

View reviewed changes

GideonPotok requested a review from uros-db May 17, 2024 20:36

uros-db reviewed May 21, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Show resolved Hide resolved

uros-db reviewed May 21, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 21, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 21, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 21, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db mentioned this pull request May 21, 2024

[WIP] Don't review: E2e #46670

Closed

GideonPotok force-pushed the spark_47353_3_clean branch from d460964 to 51f397c Compare May 21, 2024 20:47

GideonPotok requested a review from uros-db May 21, 2024 20:47

GideonPotok force-pushed the spark_47353_3_clean branch from a80a394 to 1fae9d9 Compare May 22, 2024 22:16

GideonPotok force-pushed the spark_47353_3_clean branch from 1fae9d9 to 0bab248 Compare May 22, 2024 22:21

uros-db reviewed May 24, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 24, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

uros-db reviewed May 24, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Mode.scala Outdated Show resolved Hide resolved

GideonPotok force-pushed the spark_47353_3_clean branch from 8e365a1 to b071d17 Compare May 24, 2024 15:24

GideonPotok requested a review from uros-db May 24, 2024 16:16

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

f054589

…essions/aggregate/Mode.scala Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>

GideonPotok force-pushed the spark_47353_3_clean branch from b071d17 to f054589 Compare May 24, 2024 19:45

GideonPotok and others added 2 commits May 28, 2024 17:14

Merge branch 'master' into spark_47353_3_clean

5d171d6

added new bms

a49ccd4

GideonPotok commented May 30, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-results.txt Show resolved Hide resolved

GideonPotok commented May 30, 2024

View reviewed changes

sql/core/benchmarks/CollationBenchmark-results.txt Show resolved Hide resolved

GideonPotok changed the title ~~[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce~~ [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597

[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597

GideonPotok commented May 15, 2024 •

edited

GideonPotok commented May 17, 2024

uros-db left a comment

GideonPotok commented May 17, 2024

GideonPotok commented May 18, 2024 •

edited

GideonPotok commented May 18, 2024

uros-db commented May 19, 2024 •

edited

uros-db commented May 19, 2024

uros-db commented May 19, 2024 •

edited

GideonPotok commented May 19, 2024

GideonPotok commented May 21, 2024

GideonPotok commented May 24, 2024 •

edited

GideonPotok commented May 26, 2024

uros-db commented May 28, 2024

GideonPotok commented May 30, 2024

GideonPotok commented May 31, 2024

GideonPotok commented May 31, 2024 •

edited

[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597

Are you sure you want to change the base?

[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597

Conversation

GideonPotok commented May 15, 2024 • edited

What changes were proposed in this pull request?

Pull requests

Central Change to Mode eval Algorithm:

Minor Change to Mode:

Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite to test new functionality including:

Benchmark Updates:

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

GideonPotok commented May 17, 2024

uros-db left a comment

Choose a reason for hiding this comment

GideonPotok commented May 17, 2024

GideonPotok commented May 18, 2024 • edited

GideonPotok commented May 18, 2024

uros-db commented May 19, 2024 • edited

uros-db commented May 19, 2024

uros-db commented May 19, 2024 • edited

GideonPotok commented May 19, 2024

GideonPotok commented May 21, 2024

GideonPotok commented May 24, 2024 • edited

GideonPotok commented May 26, 2024

uros-db commented May 28, 2024

GideonPotok commented May 30, 2024

GideonPotok commented May 31, 2024

GideonPotok commented May 31, 2024 • edited

GideonPotok commented May 15, 2024 •

edited

Central Change to Mode `eval` Algorithm:

Unit Test Enhancements: Significant additions to `CollationStringExpressionsSuite` to test new functionality including:

GideonPotok commented May 18, 2024 •

edited

uros-db commented May 19, 2024 •

edited

uros-db commented May 19, 2024 •

edited

GideonPotok commented May 24, 2024 •

edited

GideonPotok commented May 31, 2024 •

edited