Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce #46597

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

GideonPotok
Copy link
Contributor

@GideonPotok GideonPotok commented May 15, 2024

What changes were proposed in this pull request?

SPARK-47353

Pull requests

Scala TreeMap (RB Tree)
GroupMapReduce <- Most performant
GroupMapReduce (Cleaned up) (This PR) <- Most performant
Comparing Experimental Approaches

Central Change to Mode eval Algorithm:

  • Update to eval Method: The eval method now checks if the column being looked at is string with non-default collation and if so, uses a grouping
buff.toSeq.groupMapReduce {
        case (key: String, _) =>
          CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
        case (key: UTF8String, _) =>
          CollationFactory.getCollationKey(key, collationId)
        case (key, _) => key
      }(x => x)((x, y) => (x._1, x._2 + y._2)).values

Minor Change to Mode:

  • Introduction of collationId: A new lazy value collationId is computed from the dataType of the child expression, used to fetch the appropriate collation comparator when collationEnabled is true.

This PR will fail for complex types containing collated strings
Follow up PR will implement that

Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite to test new functionality including:

  • Tests for the Mode function when handling strings with different collation settings.

Benchmark Updates:

  • Enhanced the CollationBenchmark classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.

Why are the changes needed?

  1. Ensures consistency in handling string comparisons under various collation settings.
  2. Improves global usability by enabling compatibility with different collation standards.

Does this PR introduce any user-facing change?

Yes, this PR introduces the following user-facing changes:

  1. Adds a new collationEnabled property to the Mode expression.
  2. Users can now specify collation settings for the Mode expression to customize its behavior.

How was this patch tested?

This patch was tested through a combination of new and existing unit and end-to-end SQL tests.

  1. Unit Tests:
    • CollationStringExpressionsSuite:
      • Make the newly added tests more in the same design pattern as the existing tests
    • Added multiple test cases to verify that the Mode function correctly handles strings with different collation settings.

Out of scope: Special Unicode Cases higher planes

Tests do not need to include Null Handling.

  1. Benchmark Tests:

  2. Manual Testing:

 ./build/mvn -DskipTests clean package 
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
   spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
    import org.apache.spark.sql.types.StringType
    import org.apache.spark.sql.functions
    import spark.implicits._
    val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
    val df = data.toDF("word")
    val dfLC = df.withColumn("word",
      col("word").cast(StringType("UTF8_BINARY_LCASE")))
    val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
    dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
|  abc|
+-----+

AFTER:
+-----+
|count|
+-----+
|  Def|
+-----+

*/
  1. Continuous Integration (CI):
    • The patch passed all relevant Continuous Integration (CI) checks, including:
      • Unit test suite
      • Benchmark suite
      • Consider moving the new benchmark to the catalyst module

Was this patch authored or co-authored using generative AI tooling?

Nope!

@github-actions github-actions bot added the SQL label May 15, 2024
@GideonPotok GideonPotok force-pushed the spark_47353_3_clean branch 4 times, most recently from 01c6706 to 365e639 Compare May 15, 2024 17:02
@GideonPotok GideonPotok marked this pull request as ready for review May 16, 2024 21:22
@GideonPotok GideonPotok changed the title [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [V2] May 16, 2024
@GideonPotok GideonPotok force-pushed the spark_47353_3_clean branch 3 times, most recently from 9329234 to ec22116 Compare May 16, 2024 23:34
@GideonPotok
Copy link
Contributor Author

@uros-db This is all cleaned up. Let's get some of the other reviewers to look at it?

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?

@GideonPotok
Copy link
Contributor Author

since Mode expression works with any child expression, and you special-cased handling Strings, how do we handle Array(String) and Struct(String), etc.?

In my local tests, I found that Mode performs a byte-by-byte comparison for structs, which does not consider collation. So that is still outstanding. Good catch!

@uros-db There are several strategies we might adopt to handle structs with collation fields. I am looking into implementations. It is potentially straightforward though have some gotchas.

Do you feel I should solve for that in a separate PR or in this one? I assume you prefer that this get solve in this PR and not a follow-up PR, right?

@GideonPotok GideonPotok requested a review from uros-db May 17, 2024 20:36
@GideonPotok
Copy link
Contributor Author

GideonPotok commented May 18, 2024

@uros-db

I have added implementation for mode to support structs with fields with the various collations. Performance is not great, so far.

[info] collation unit benchmarks - mode - 30105 elements:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ---------------------------------------------------------------------------------------------------------------------------------
[info] UTF8_BINARY_LCASE - mode - 30105 elements                     31             32           1          9.8         102.3       1.0X
[info] UNICODE - mode - 30105 elements                                1              1           0        240.4           4.2      24.6X
[info] UTF8_BINARY - mode - 30105 elements                            1              1           0        239.1           4.2      24.5X
[info] UNICODE_CI - mode - 30105 elements                            57             59           2          5.3         189.9       0.5X

I will add the benchmark results from GHA once I get your feedback.

I haven;t yet added support for collation for mode on array types, as in the "Collation Support in Spark" design doc, it says support for that is TBD. So I wanted to check in as to whether you think I should add support for that now or as a followup.

@GideonPotok
Copy link
Contributor Author

What I would really like to try is to move from this implementation to an approach that will have the collation-support logic moved to the PartialAggregation stage, by moving logic to Mode.merge and Mode.update. I would use a modified open hash map for that with hashing based on the collation key and with a separate map to map from collation key to one of the actual values observed that maps to that collation key (which experimentation has shown could work).

But as it has already been a couple weeks of development on this, I believe we should, for this PR, confine all the collation logic in the stage that can't be serialized and deserialized -- the eval stage. And I should try what I have described above in a PR raised after we have merged the approach that has already been tested (i.e. this PR).

@uros-db
Copy link
Contributor

uros-db commented May 19, 2024

I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs.

However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc.

With that in mind, it seems that we should adopt one of two approaches:

  • implement the support for collated StringType in this PR, but fail (throw exception) for complex types that have collated strings
  • implement full support at once

@uros-db
Copy link
Contributor

uros-db commented May 19, 2024

also note that covering StringTypes which are fields of StructType is not by itself enough - suppose there's a field of StructType that is another StructType that has a field of collated StringType, etc.

same goes for arrays, handling ArrayType(StringType) is not enough by itself - we also need ArrayType(ArrayType(StringType))

in short, I would say that we need a recursive approach to properly handle all possible collated string instances

@uros-db
Copy link
Contributor

uros-db commented May 19, 2024

As for changing how Mode.update works in order to inject collationKey, I think that should be enough to do the trick? it seems that Mode.merge should then work by default

but then of course there's the problem of preserving one of the actual values - you correctly noticed that we can't just return collationKey, as that value might not be present in the original array

I suppose a separate map might do the trick here (mapping collationKey to original string value), and since we don't have preference towards which value gets returned, simply returning the first one that appeared is considered correct behaviour

@GideonPotok
Copy link
Contributor Author

I wouldn't say there's a preference on whether to include both support for string type and complex types within the same PR - if you think that the changes might end up being too large, then it's fine to split it into separate PRs.

However I would say that we need to make sure there's no unexpected behaviour - for example, MODE shouldn't have correct support for collated StringType, but incorrect behaviour for ArrayType(StringType), StructType(...StringType...), etc.

With that in mind, it seems that we should adopt one of two approaches:

  • implement the support for collated StringType in this PR, but fail (throw exception) for complex types that have collated strings

  • implement full support at once

@uros-db if you are fine with me splitting it into two PRs, that's what I will do! I will modify this PR to fail for complex types that have collated strings. And I will get the PR to implement full (recursive) support for said complex types ready to be reviewed right after this one is merged. I appreciate your flexibility!

@GideonPotok
Copy link
Contributor Author

@uros-db I have made changes for all but your latest suggestion (re whitelists -- will add that soon)

latest review

added checkinputdatatype to not support complex types containing nonbinary collations

added checkinputdatatype to not support complex types containing nonbinary collations

added struct test stuff

Tests pass

test structs

fix

scalastyle

Collation Support for Mode
@GideonPotok
Copy link
Contributor Author

GideonPotok commented May 24, 2024

@uros-db Should I also add collation support to org.apache.spark.sql.catalyst.expressions.aggregate.PandasMode?

The only difference will be

  1. Support for null keys (thus StringType won't necessarily mean all values in buffer are UTF8String, some might just be null, right?)
  2. PandasMode returns a list of all values that are tied for mode. In that case, should all the values be present? Eg if you have the pandas_mode of ['a', 'a', 'a', 'b', 'b', 'B'], with utf_binary_lcase collation, what do you think pandas_mode should return? If we want to support PandasMode, I can do a little research on what other databases seem to favor for this type of question.

@GideonPotok GideonPotok requested a review from uros-db May 24, 2024 16:16
…essions/aggregate/Mode.scala

Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
@GideonPotok
Copy link
Contributor Author

@uros-db ?

@uros-db
Copy link
Contributor

uros-db commented May 28, 2024

We can leave PandasMode for a separate PR, but we'll definitely need to take care of it at one point

now that you've explored various options and finished the groupMapReduce approach, I think should can call in other SQL team reviewers to take a look at this and provide their feedback: @dbatomic @nikolamand-db @stefankandic @stevomitric

@GideonPotok GideonPotok changed the title [WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce [SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce May 30, 2024
@GideonPotok
Copy link
Contributor Author

@GideonPotok
Copy link
Contributor Author

@uros-db when should I add back support for complex types? Should i wait until we have buy-in for the current approach from @dbatomic @nikolamand-db @stefankandic @stevomitric or should I do it now ?

@GideonPotok
Copy link
Contributor Author

GideonPotok commented May 31, 2024

(I no longer think the code for support for complex types needs to be a seperate PR. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants