Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Legacy Workload Ports #79

Open
ecurtin opened this issue Sep 20, 2017 · 5 comments
Open

EPIC: Legacy Workload Ports #79

ecurtin opened this issue Sep 20, 2017 · 5 comments

Comments

@ecurtin
Copy link
Contributor

ecurtin commented Sep 20, 2017

Port all workloads available in legacy version to new version.

@akasaki
Copy link

akasaki commented Jan 2, 2018

Hello @ecurtin , I am wondering if the legacy version can be compatible with Spark 2.2. I need more workloads for my thesis experiment.

BTW, thank you so much for taking time to answer all my questions!

@ecurtin
Copy link
Contributor Author

ecurtin commented Jan 3, 2018

@akasaki It depends on what you mean by compatible. Both versions have data generators that output data to disk and workloads that pick up that data and do stuff with it, but they are entirely different code bases. You're totally welcome to try to the legacy version if you think it might suit your needs better, just keep in mind that it is unsupported.

Are there any workloads in particular that are high priority for you?

@akasaki
Copy link

akasaki commented Jan 3, 2018

@ecurtin I focus on the tuning algorithm based on three types of workloads. The journal (Li M, Tan J, Wang Y, Zhang L, Salapura V. SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Cluster Computing. 2017:1-5.) classified all workloads into three types, memory intensive, shuffle intensive and all intensive. In the current version, SQL workload is shuffle intensive, and linear regression is memory intensive although linear regression doesn't work under my environment (Issue #134).

I suppose K-means is also memory intensive, isn't it?

I need one or more all-intensive workloads such as MF and SVD++. I am trying to setup the legacy version.

@ecurtin
Copy link
Contributor Author

ecurtin commented Jan 3, 2018

SparkPi is included in the current version of Spark-Bench. It's extremely compute-intensive (when used with large parameters) while hardly making use of I/O at all. Basically it computes an approximate value of Pi in a deliberately inefficient manner: https://sparktc.github.io/spark-bench/workloads/sparkpi/

@akasaki
Copy link

akasaki commented Jan 4, 2018

@ecurtin I see. I have tried it as the first example, but it doesn't have any shuffle operation. I am looking for some all-intensive (both shuffle intensive and memory intensive) workloads which consume both I/O and memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants