Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spark350emr shim layer [EMR] #10202

Open
wants to merge 2 commits into
base: branch-24.02
Choose a base branch
from

Conversation

mimaomao
Copy link
Contributor

Summary

This PR targets to add a new shim layer spark350emr which supports running Spark RAPIDS on AWS EMR Spark 3.5.0.

Testing

Unit Testing

I ran full suite of unit tests with example command as below:

mvn clean install

and got the following results:

Run completed in 30 minutes, 23 seconds.
Total number of tests run: 1214
Suites: completed 123, aborted 0
Tests: succeeded 1205, failed 9, canceled 54, ignored 16, pending 0

After some investigation and analysis, I found the following:

AdaptiveQueryExecSuite
- Join partitioned tables DPP fallback *** FAILED ***
- Exchange reuse *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec
SortExecSuite
- IGNORE ORDER: join longs *** FAILED ***
- IGNORE ORDER: join longs multiple batches *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec
CostBasedOptimizerSuite
- Force section of plan back onto CPU, AQE off *** FAILED ***
- keep CustomShuffleReaderExec on GPU *** FAILED ***
- Compute estimated row count nested joins no broadcast *** FAILED ***

NOTE: Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.FilterExec
JoinsSuite
- IGNORE ORDER: IGNORE ORDER: Test hash join *** FAILED ***
- INCOMPAT, IGNORE ORDER: Test replace sort merge join with hash join *** FAILED ***

[NOTE:]([url]()) Below exception is found from the log.
Part of the plan is not columnar class org.apache.spark.sql.execution.exchange.ShuffleExchangeExec

We started an email-thread before to discuss these unit test failures but we still don't fix these test failures yet and need some help on these ones.

Integration Testing

I ran full suite of integration tests and found some test failures we've discussed before as follows:

FAILED src/main/python/arithmetic_ops_test.py::test_multiplication[Decimal(38,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Integer-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(10,-2)-Decimal(30,10)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(15,3)-Decimal(30,10)][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Integer][DATAGEN_SEED=0, INJECT_OOM]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(30,12)-Decimal(16,7)][DATAGEN_SEED=0]
FAILED src/main/python/arithmetic_ops_test.py::test_multiplication_mixed[Decimal(27,7)-Decimal(16,7)][DATAGEN_SEED=0, INJECT_OOM]

The failed reason is that AWS EMR back-ported the fix about this issue in Spark 3.5.0. Thus, I mark them as xfail in this change.

Reproduction

To reproduce my test results, you can use the following environment:

  1. This PR as a patch based on branch-24.02.
  2. AWS EMR Spark: 3.5.0-amzn-0. You can get it from AWS EMR cluster with release emr-7.0.0.

Also, for your convenience, I've already attached this patch and scala-test-detailed-output.log as here: unit_tests.zip

Others

For this change to build successfully, we also need the following change:

# before
<dependency>
            <groupId>com.nvidia</groupId>
            <artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
            <version>${spark-rapids-private.version}</version>
            <classifier>${spark.version.classifier}</classifier>
 </dependency>
# after
<dependency>
            <groupId>com.nvidia</groupId>
            <artifactId>rapids-4-spark-private_${scala.binary.version}</artifactId>
            <version>${spark-rapids-private.version}</version>
            <classifier>spark350</classifier>
 </dependency>

I think this is because that there is no such a private artifact for spark350emr. Do we need additional change for this or will the NVIDIA take care of it?

This PR targets to add a new shim layer spark350emr which supports
running Spark RAPIDS on AWS EMR Spark 3.5.0.

---------

Signed-off-by: Maomao Min <mimaomao@amazon.com>
@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jan 18, 2024

This PR will add a new 350emr shim into branch-24.02, will this be included into our 24.02.0 release?

Suppose compiling and integration tests will be running on EMR cluster.

This would be a risk as burndown is only 1 week away 2024-01-29 according to our release plan, but may don't have time setting up compiling/integration test CI jobs on EMR cluster.

@@ -160,6 +160,8 @@ def test_subtraction_ansi_no_overflow(data_gen):
_decimal_gen_38_10,
_decimal_gen_38_neg10
], ids=idfn)
@pytest.mark.xfail(condition=is_spark_350emr(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we actually should have the fix in our spark351 shim (for the Spark 3.5.1 SNAPSHOT), so you should be able to incorporate the multiplication fix into the EMR shim layer. See #9859 and #9962.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll try to incorporate the mentioned fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024 copyrights for this and many other files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add.

Comment on lines +960 to +987
<!-- -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>${slf4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j2-impl</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-api</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<version>${log4j.version}</version>
</dependency>
<dependency>
<!-- API bridge between log4j 1 and 2 -->
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-1.2-api</artifactId>
<version>${log4j.version}</version>
</dependency>
<!-- -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why all of these were added? Especially wondering about the log4j dependencies since we normally want to use slf4j APIs when logging instead of hitting log4j.

Normally we leave these sort of dependencies out because we explicitly want to pull in whatever version Spark is using and compile against that. The risk of explicitly pulling this in is that it might conflict with the version used by the particular Spark version we're compiling against and instead of a compile time problem we get a runtime problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because I encountered errors related to log4j & slf4j such as NoSuchMethod when running unit tests. After adding them, they were gone. I agree that it might introduce risks for this approach. Any better approaches we can try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems to imply there's an issue with the test classpath where the APIs that were available at compile-time aren't present at runtime. I'd have Maven dump the compile and test classpaths and see if you can find jars that are missing in test vs. compile that would explain it. We should be picking these up as transitive dependencies from the Spark jars. Maybe there's an issue with the EMR Spark dependencies where that's somehow not the case (but only at test runtime?!).

@@ -0,0 +1,53 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that the emr/pom.xml is a new file in this PR with NVIDIA copyrights but this file does not have the copyright -- intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the copyright. Will add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file's parent directory is named spark350 instead of spark350emr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

@@ -0,0 +1,112 @@
/*
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a brand new file.

Suggested change
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
* Copyright (c) 2024, NVIDIA CORPORATION.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix.

Co-authored-by: Navin Kumar <97137715+NVnavkumar@users.noreply.github.com>
@gerashegalov
Copy link
Collaborator

For failing PR checks you need to run ./build/make-scala-version-build-files.sh 2.13 and merge modified poms to the PR branch

./build/make-scala-version-build-files.sh 2.13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants