Add spark350emr shim layer [EMR] #10202

mimaomao · 2024-01-16T09:13:59Z

Summary

This PR targets to add a new shim layer spark350emr which supports running Spark RAPIDS on AWS EMR Spark 3.5.0.

Testing

Unit Testing

I ran full suite of unit tests with example command as below:

mvn clean install

and got the following results:

Run completed in 30 minutes, 23 seconds.
Total number of tests run: 1214
Suites: completed 123, aborted 0
Tests: succeeded 1205, failed 9, canceled 54, ignored 16, pending 0

After some investigation and analysis, I found the following:

AdaptiveQueryExecSuite
- Join partitioned tables DPP fallback *** FAILED ***
- Exchange reuse *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec

SortExecSuite
- IGNORE ORDER: join longs *** FAILED ***
- IGNORE ORDER: join longs multiple batches *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec

CostBasedOptimizerSuite
- Force section of plan back onto CPU, AQE off *** FAILED ***
- keep CustomShuffleReaderExec on GPU *** FAILED ***
- Compute estimated row count nested joins no broadcast *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec

JoinsSuite
- IGNORE ORDER: IGNORE ORDER: Test hash join *** FAILED ***
- INCOMPAT, IGNORE ORDER: Test replace sort merge join with hash join *** FAILED ***

[NOTE:]([url]()) Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec

We started an email-thread before to discuss these unit test failures but we still don't fix these test failures yet and need some help on these ones.

Integration Testing

I ran full suite of integration tests and found some test failures we've discussed before as follows:

FAILED src/main/python/arithmetic_ops_test.py::test_multiplication[Decimal(38,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Integer-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(10,-2)-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(15,3)-Decimal(30,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Integer][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Decimal(16,7)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(27,7)-Decimal(16,7)][DATAGEN_SEED=0, INJECT_OOM]

The failed reason is that AWS EMR back-ported the fix about this issue in Spark 3.5.0. Thus, I mark them as xfail in this change.

Reproduction

To reproduce my test results, you can use the following environment:

This PR as a patch based on branch-24.02.
AWS EMR Spark: 3.5.0-amzn-0. You can get it from AWS EMR cluster with release emr-7.0.0.

Also, for your convenience, I've already attached this patch and scala-test-detailed-output.log as here: unit_tests.zip

Others

For this change to build successfully, we also need the following change:

# before
<dependency>
            <groupId>com.nvidia</groupId>
            <artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
            <version>${spark-rapids-private.version}</version>
            <classifier>${spark.version.classifier}</classifier>
 </dependency>
# after
<dependency>
            <groupId>com.nvidia</groupId>
            <artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
            <version>${spark-rapids-private.version}</version>
            <classifier>spark350</classifier>
 </dependency>

I think this is because that there is no such a private artifact for spark350emr. Do we need additional change for this or will the NVIDIA take care of it?

This PR targets to add a new shim layer spark350emr which supports running Spark RAPIDS on AWS EMR Spark 3.5.0. --------- Signed-off-by: Maomao Min <mimaomao@amazon.com>

NvTimLiu · 2024-01-18T14:37:42Z

This PR will add a new 350emr shim into branch-24.02, will this be included into our 24.02.0 release?

Suppose compiling and integration tests will be running on EMR cluster.

This would be a risk as burndown is only 1 week away 2024-01-29 according to our release plan, but may don't have time setting up compiling/integration test CI jobs on EMR cluster.

NVnavkumar · 2024-01-17T18:28:32Z

integration_tests/src/main/python/arithmetic_ops_test.py

@@ -160,6 +160,8 @@ def test_subtraction_ansi_no_overflow(data_gen):
    _decimal_gen_38_10,
    _decimal_gen_38_neg10
    ], ids=idfn)
+@pytest.mark.xfail(condition=is_spark_350emr(),


So we actually should have the fix in our spark351 shim (for the Spark 3.5.1 SNAPSHOT), so you should be able to incorporate the multiplication fix into the EMR shim layer. See #9859 and #9962.

Thanks! I'll try to incorporate the mentioned fix.

integration_tests/src/main/python/spark_session.py

jlowe · 2024-01-18T22:30:03Z

aggregator/pom.xml

2024 copyrights for this and many other files

jlowe · 2024-01-18T22:39:35Z

pom.xml

+            <!-- -->
+            <dependency>
+                <groupId>org.slf4j</groupId>
+                <artifactId>slf4j-api</artifactId>
+                <version>${slf4j.version}</version>
+            </dependency>
+            <dependency>
+                <groupId>org.apache.logging.log4j</groupId>
+                <artifactId>log4j-slf4j2-impl</artifactId>
+                <version>${log4j.version}</version>
+            </dependency>
+            <dependency>
+                <groupId>org.apache.logging.log4j</groupId>
+                <artifactId>log4j-api</artifactId>
+                <version>${log4j.version}</version>
+            </dependency>
+            <dependency>
+                <groupId>org.apache.logging.log4j</groupId>
+                <artifactId>log4j-core</artifactId>
+                <version>${log4j.version}</version>
+            </dependency>
+            <dependency>
+                <!-- API bridge between log4j 1 and 2 -->
+                <groupId>org.apache.logging.log4j</groupId>
+                <artifactId>log4j-1.2-api</artifactId>
+                <version>${log4j.version}</version>
+            </dependency>
+            <!-- -->


Curious why all of these were added? Especially wondering about the log4j dependencies since we normally want to use slf4j APIs when logging instead of hitting log4j.

Normally we leave these sort of dependencies out because we explicitly want to pull in whatever version Spark is using and compile against that. The risk of explicitly pulling this in is that it might conflict with the version used by the particular Spark version we're compiling against and instead of a compile time problem we get a runtime problem.

I added this because I encountered errors related to log4j & slf4j such as NoSuchMethod when running unit tests. After adding them, they were gone. I agree that it might introduce risks for this approach. Any better approaches we can try?

That seems to imply there's an issue with the test classpath where the APIs that were available at compile-time aren't present at runtime. I'd have Maven dump the compile and test classpaths and see if you can find jars that are missing in test vs. compile that would explain it. We should be picking these up as transitive dependencies from the Spark jars. Maybe there's an issue with the EMR Spark dependencies where that's somehow not the case (but only at test runtime?!).

jlowe · 2024-01-18T22:45:15Z

sql-plugin/src/main/spark350emr/scala/com/nvidia/spark/rapids/GpuOutputAdapterExec.scala

@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more


Notice that the emr/pom.xml is a new file in this PR with NVIDIA copyrights but this file does not have the copyright -- intentional?

I missed the copyright. Will add it.

jlowe · 2024-01-18T22:48:06Z

...lugin/src/main/spark350emr/scala/com/nvidia/spark/rapids/spark350/RapidsShuffleManager.scala

This file's parent directory is named spark350 instead of spark350emr.

jlowe · 2024-01-18T22:52:23Z

...c/test/spark350emr/org/apache/spark/sql/rapids/execution/GpuSubqueryBroadcastExecSuite.scala

@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2019-2021, NVIDIA CORPORATION.


This appears to be a brand new file.

Suggested change

* Copyright (c) 2019-2021, NVIDIA CORPORATION.

* Copyright (c) 2024, NVIDIA CORPORATION.

Co-authored-by: Navin Kumar <97137715+NVnavkumar@users.noreply.github.com>

gerashegalov · 2024-01-20T05:46:26Z

For failing PR checks you need to run ./build/make-scala-version-build-files.sh 2.13 and merge modified poms to the PR branch

spark-rapids/CONTRIBUTING.md

Line 72 in 645daa1

./build/make-scala-version-build-files.sh 2.13

Add spark350emr shim layer [EMR]

b26e1b7

This PR targets to add a new shim layer spark350emr which supports running Spark RAPIDS on AWS EMR Spark 3.5.0. --------- Signed-off-by: Maomao Min <mimaomao@amazon.com>

mimaomao requested review from jlowe, revans2, tgravescs, GaryShen2008 and NvTimLiu as code owners January 16, 2024 09:14

NVnavkumar reviewed Jan 18, 2024

View reviewed changes

jlowe reviewed Jan 18, 2024

View reviewed changes

Update integration_tests/src/main/python/spark_session.py

6543078

Co-authored-by: Navin Kumar <97137715+NVnavkumar@users.noreply.github.com>

mimaomao mentioned this pull request Feb 22, 2024

Add spark350emr shim layer [EMR] #10463

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spark350emr shim layer [EMR] #10202

Add spark350emr shim layer [EMR] #10202

mimaomao commented Jan 16, 2024

NvTimLiu commented Jan 18, 2024 •

edited

NVnavkumar Jan 17, 2024

mimaomao Jan 19, 2024

jlowe Jan 18, 2024

mimaomao Jan 19, 2024

jlowe Jan 18, 2024

mimaomao Jan 19, 2024

jlowe Jan 19, 2024

jlowe Jan 18, 2024

mimaomao Jan 19, 2024

jlowe Jan 18, 2024

mimaomao Jan 19, 2024

jlowe Jan 18, 2024

mimaomao Jan 19, 2024

gerashegalov commented Jan 20, 2024

		@@ -0,0 +1,53 @@
		/*
		* Licensed to the Apache Software Foundation (ASF) under one or more

		@@ -0,0 +1,112 @@
		/*
		* Copyright (c) 2019-2021, NVIDIA CORPORATION.

	* Copyright (c) 2019-2021, NVIDIA CORPORATION.
	* Copyright (c) 2024, NVIDIA CORPORATION.

Add spark350emr shim layer [EMR] #10202

Are you sure you want to change the base?

Add spark350emr shim layer [EMR] #10202

Conversation

mimaomao commented Jan 16, 2024

Summary

Testing

Unit Testing

Integration Testing

Reproduction

Others

NvTimLiu commented Jan 18, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerashegalov commented Jan 20, 2024

NvTimLiu commented Jan 18, 2024 •

edited