Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to understand the spark-lens statistics. Is there any wiki? #50

Open
normalscene opened this issue Feb 5, 2020 · 16 comments
Open

Comments

@normalscene
Copy link

Hi @mayurdb / @iamrohit / @rishitesh

We have generated our application's performance report but we are having trouble understanding the output report (html). We have generated the JSON file and we uploaded it to your reporting service which generated a nice HTML report (http://sparklens.qubole.com/report_view/616b961beb11dc0298b1)

Could you please direct us to any type of documentation which explains how to interpret the output HTML report ? Thank you in advance.

Gaurav

@normalscene
Copy link
Author

update: I was able to find some explanation about the reports on (https://www.qubole.com/blog/introducing-sparklens-report/). This is fine.

If you take a look at the generated report, there are some values in negative (eg: under "efficiency statistics" > "Driver vs Executor Wallclock Time").

We are not able to understand why the values are negative and regarding this - there is no mention on the said blog site.

@normalscene
Copy link
Author

Hello - I am following up on my question. Thanks.

@normalscene
Copy link
Author

Hi, I am seeking help on the generated report. Please advise.

@mayurdb
Copy link
Collaborator

mayurdb commented Feb 10, 2020

Hi, @normalscene

The negative values of the driver/executor wall-clock time might be an indication of event loss while running the spark application. In the newer releases of the Sparklens, we have handled the event loss gracefully. Were you using the older version while generating the sparklens.json? If so, I would suggest you try with the latest version.

If you are already using the latest version, can you please try to generate the report once via the command line instead of using sparklens.qubole.com. This will help us narrow down the issue to sparklens reporting service or sparklens core. You can find the steps to generate the sparklens command-line report in the ReadMe.

@normalscene
Copy link
Author

Hello @mayurdb

The current version being used is as follows:

gaurav.arya@UB0013:~/mycode/myScalaStuff/sparklens$ cat version.sbt 
version in ThisBuild := "0.3.1"
gaurav.arya@UB0013:~/mycode/myScalaStuff/sparklens$ 

I believe this is the latest version as I took it recently (less than 2 weeks). Please correct me if I am incorrect here.

Unfortunately I am having difficulty spotting the command line reporting steps. Could you kindly just point me towards the section no. ?

Just for record sake, here is my command which I am using to submit my job:

gcloud dataproc jobs submit spark \
	--project our-project \
	--cluster deltest \
	--region us-central1  \
	--properties \
			spark.submit.deployMode=cluster,\
			spark.dynamicAllocation.enabled=false,\
			spark.yarn.maxAppAttempts=1,\
			spark.driver.memory=4G,\
			spark.driver.memoryOverhead=1024m,\
			spark.executor.instances=3,\
			spark.executor.memoryOverhead=1024m,\
			spark.executor.memory=4G,\
			spark.executor.cores=2,\
			spark.driver.cores=1,\
			spark.driver.maxResultSize=2g,\
			spark.extraListeners=com.qubole.sparklens.QuboleJobListener \
	--class com.figmd.someProduct.deletion.dataCleanerMain \
	--jars=gs://our_repository/jars/newDataCleaner.jar,\
			gs://spark-lib/bigquery/spark-bigquery-latest.jar,\
			gs://our_repository/jars/jdbc-postgresql.jar,\
			gs://our_repository/jars/postgresql-42.2.5.jar,\
			gs://our_repository/jars/sparklens_2.11-0.3.1.jar \
	-- cdr 289 AAAA,BBBB our-project CDRDELTEST 20200121 0001

@mayurdb
Copy link
Collaborator

mayurdb commented Feb 10, 2020

With the above command itself, you should be able to see the report in the driver logs after application completion.

@normalscene
Copy link
Author

Oh! Understood. I will have to re-run the job (it takes around 75 minutes) and I will get back after that.

thanks for your prompt reply.

Regards,
Gaurav

@mayurdb
Copy link
Collaborator

mayurdb commented Feb 10, 2020

To get more accurate results, before you re-run see if there was an event loss in the previous runs. grep for key-word "Dropped" in the driver logs. If there was an event loss, try increasing the event queue capacity to a higher value and also increase the driver's memory. This is the conf: spark.scheduler.listenerbus.eventqueue.capacity

@normalscene
Copy link
Author

normalscene commented Feb 10, 2020

@mayurdb

I have re-run the job (but this time, took a smaller data sample to make the processing a little quick compared to earlier one, which took around 75 minutes).

JSON reporting & command-line statistics - both were enabled.

Here are my observations:

  1. Looked for "Dropped" (case insensitive search) in the logs but we didn't find it. So I guess nothing was dropped.

  2. With or without spark.scheduler.listenerbus.eventqueue.capacity the output reports are showing negative values.

Please advise which logs can I share with you to help investigate this.

Edit: Does this seem like a bug ? Please confirm.

Links to generated reports:

http://sparklens.qubole.com/report_view/0cfede16246e4423c402 (without drop queue setting)
http://sparklens.qubole.com/report_view/447874de4f73db369339 (with drop queue setting)

@iamrohit
Copy link
Contributor

@normalscene
The negative numbers can show up for various reasons. One of the things we try to do is to find out how much time executors were busy vs when drivers was busy. We don't really have any metrics around what the driver is doing in the event log. This is estimated by subtracting the time spent in executors from the total execution time. The negative numbers show up when our estimate of time spent in executor end up exceeding the total application time. Main difficulty is figuring out which stages ran in parallel. When we make a mistake in this part, we end up adding time spent instead of taking max among stages. As @mayurdb mentioned another reason is loss of events, where the time doesn't adds up because some events are missing. Please share the event log file and I can try figuring this out. thanks.

@normalscene
Copy link
Author

@iamrohit

Could you please point me towards the "event log file". I believe I am not aware of its whereabouts.
Kindly assist and I shall make it available ASAP. Thanks in advance.

Cheers,
Gaurav

@iamrohit
Copy link
Contributor

@normalscene

Your will need to add the following config options to your spark-submit.
spark.eventLog.enabled (true)
spark.eventLog.dir (s3:///some/where/)

After the application is complete, event log file will be available in the spark.eventLog.dir

@normalscene
Copy link
Author

@iamrohit

Thank you. I will try this out and get back with results shortly.

@normalscene
Copy link
Author

@iamrohit

I gave it a shot and unfortunately it didn't generate any logs :(.

Allow me to share the command I used just to double check whether I fired the right command or not. Have I missed something?

gcloud dataproc jobs submit spark 
	--project bda-sandbox 
	--cluster cdrperf 
	--region us-central1 
	--properties 
		spark.eventLog.enabled=true,
		spark.eventLog.dir=gs://cdrmigration/spark-lens-reports/cdrperftest/eventlogs,
		spark.sparklens.reporting.disabled=false,
		spark.sparklens.data.dir=gs://cdrmigration/spark-lens-reports/cdrperftest,
		spark.extraListeners=com.qubole.sparklens.QuboleJobListener,
		spark.submit.deployMode=cluster,
		spark.dynamicAllocation.enabled=false,
		spark.yarn.maxAppAttempts=1,
		spark.driver.memory=36G,
		spark.driver.memoryOverhead=4096m,
		spark.executor.instances=3,
		spark.executor.memoryOverhead=4096m,
		spark.executor.memory=36G,
		spark.executor.cores=8,
		spark.driver.cores=4,
		spark.driver.maxResultSize=10g 
	--jars=gs://cdrmigration/jars/cdrprocprod.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar 
	--class com.figmd.janus.filetocdr.CDRProcessing 
	-- "registry_name" "cdrmigration" jjjjjjjj-cf66-428b-bf02-caa1f608bf36 51dce49e-854a-4f54-8127-34472b12a5ac kkkkkkkk-854a-4f54-8127-34472b12a5ac

@normalscene
Copy link
Author

@iamrohit

I have captured the sparklens command-line output. It doesn't give us any indication whether if the event log capture was invoked or not.

@iamrohit
Copy link
Contributor

@normalscene Not really much aware of the internal workings of dataproc. If you have access to history server, newer versions allow downloading this file directly from the history server UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants