Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ADAM dependency version to 0.27.0 #33

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

heuermh
Copy link
Member

@heuermh heuermh commented May 10, 2019

Fixes #34

Updates ADAM dependency version to 0.27.0. Note the workaround in maven-shade-plugin configuration to prevent runtime conflicts with parquet and avro versions.

@heuermh
Copy link
Member Author

heuermh commented May 10, 2019

@mlinderm Would adding Scala 2.12 support be useful? Spark 2.4.3 supports Scala 2.12 but getting things running on the binary distribution is a nightmare.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/deca-prb/32/
Test PASSed.

@heuermh heuermh changed the title Update ADAM dependency version to 0.27.0-SNAPSHOT Update ADAM dependency version to 0.27.0 May 23, 2019
@heuermh heuermh marked this pull request as ready for review May 23, 2019 15:45
@heuermh heuermh requested review from mlinderm and fnothaft May 23, 2019 15:45
@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/deca-prb/33/
Test PASSed.

@mlinderm
Copy link
Collaborator

@heuermh I am observing a substantial performance degradation with the new version. On a 16 core workstation, the time for calling CNVs in 2535 samples went from 9m17s with the old version, to 12m22s for the new version (with ADAM 0.27 and Spark 2.4.3). The difference seems to be in PCA/SVD step. Do you have any guesses as to what might have changed between the old spark/ADAM version the current version?

@heuermh
Copy link
Member Author

heuermh commented May 27, 2019

The difference seems to be in PCA/SVD step. Do you have any guesses as to what might have changed between the old spark/ADAM version the current version?

There are a lot of code changes between 0.24.0 and 0.27.0 in ADAM, but not much that would affect numerical method performance

bigdatagenomics/adam@maint_spark2_2.11-0.24.0...maint_spark2_2.11-0.27.0

I imagine differences between Spark 2.1.x and 2.4.3 might be more significant though. Is there a smaller benchmark we could use to reproduce what you are seeing, say with fewer samples?

@mlinderm
Copy link
Collaborator

There are datasets ranging from 500 samples to 2535 samples on the AMP BDG cluster at /user/mlinderman/deca/DATA.<samples>.RD.txt (if that is not easy to obtain, I can post input files for you). The former should only run for a few minutes or less. On the workstation the old version with Spark 2.1.0 took 1m34s for 500 samples, the new version took 2m17s. I can start working through different Spark versions to see if I observe a change.

@heuermh
Copy link
Member Author

heuermh commented May 27, 2019

There are datasets ranging from 500 samples to 2535 samples on the AMP BDG cluster

Great, thanks. I'll also take a look tomorrow.

@mlinderm
Copy link
Collaborator

I tried several spark distributions with the original code base (older ADAM). The performance degradation seems to occur between 2.2.3 and 2.3.3, that is 2.2.3 was 1m38s while 2.3.3 was 2m13s for 500 samples. One guess was that it was an issue with an upgrade to Breeze, but changing the Breeze dependency to 0.13.2 with Spark 2.4.3 did not improve performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ADAM dependency version to 0.27.0
3 participants