Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to do a spark-submit with a SparkListener to gather events from Spark #113

Open
pjfanning opened this issue Oct 26, 2017 · 5 comments

Comments

@pjfanning
Copy link
Contributor

pjfanning commented Oct 26, 2017

I was at Emily Curtin's Spark Summit Europe presentation today (which was very interesting). An attendee asked if Spark Bench gathered Spark executor metrics.
A SparkListener can be used to get benchmark data about how long was spent running tasks and how much data was shuffled (basically any data that can be seen in the Spark UI could be picked up and summarised).
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/scheduler/SparkListener.html
spark-submit --conf spark.extraListeners=com.mycompany.MetricsListener
https://github.com/LucaCanali/sparkMeasure has a spark listener that gathers metrics.
https://github.com/groupon/sparklint also has one.

One possible design would be to

  • run spark-submit with a SparkListener that outputs the event data (eg as CSV)
  • run another spark job to summarise the event data and include the summary metrics with the other benchmark data

Another approach would be to run spark with spark.eventLog.enabled=true (and spark.eventLog.dir set) and parsing the json-lines output. https://github.com/groupon/sparklint also has code to summarise event logs to create metrics.

@ecurtin
Copy link
Contributor

ecurtin commented Oct 28, 2017

Hi @pjfanning! I'm so glad you thought the talk was interesting :) For anybody else reading who wants to see it, they've told us it will be posted on Nov 3.

What you've outlined here is a great suggestion! While I have not tried it myself yet, adding listeners through the spark-submit conf should already work through existing means, like this:

spark-bench = {
  spark-submit-config = {
    spark-home = // ...
    spark-args = {
      // master, etc
    }
    conf = {
      "spark.extraListeners" = "com.mycompany.MetricsListener"
    }
  }
}

If that works out of the box, then getting that output bundled with the spark-bench output would be the logical next step.

@pjfanning Is this something you'd be interested in investigating?

Thanks again for your helpful suggestion! I am in shaky wifi territory for the next two days but will be back in regular communication after that :)

@pjfanning
Copy link
Contributor Author

@ecurtin I may not have much time over the coming weeks but if I do find some time, I'll try prototyping something.

@ecurtin
Copy link
Contributor

ecurtin commented Oct 30, 2017

👍

@pjfanning
Copy link
Contributor Author

I have a very early prototype at https://github.com/pjfanning/spark-bench/pull/2/files

Running bin/spark-bench.sh examples/minimal-example.conf on a distro with my change outputs

+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+
|   name|    timestamp|total_runtime|    pi_approximate|input|output|slices|run|spark.driver.host|spark.driver.port|spark.extraListeners|hive.metastore.warehouse.dir|          spark.jars|      spark.app.name|spark.executor.id|spark.submit.deployMode|spark.master|spark.authenticate|spark.authenticate.secret|       spark.app.id|         description|
+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+
|sparkpi|1509741483169|   1468030834|3.1425311425311424|     |      |    10|  0|    192.168.1.100|            64309|com.ibm.sparktc.s...|        file:/Users/pj.fa...|file:/Users/pj.fa...|com.ibm.sparktc.s...|           driver|                 client|    local[*]|              true|            not.so.secret|local-1509741482934|One run of SparkP...|
+-------+-------------+-------------+------------------+-----+------+------+---+-----------------+-----------------+--------------------+----------------------------+--------------------+--------------------+-----------------+-----------------------+------------+------------------+-------------------------+-------------------+--------------------+

**** MetricsSparkListener ****
stageCount=2
taskCount=11
jobCount=2
executorAddCount=1
executorRemoveCount=0

The aim is to gather more metrics with the listener and to include them with the other benchmarks.
This would involve writing the metric data to a file and having spark-bench read that data and extending the benchmark data with these additional metrics.

@xiandong79
Copy link

a CSV file recording the task-durations of all tasks would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants