Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor heartbeat timed out error message #38

Open
rajitz opened this issue Apr 21, 2020 · 0 comments
Open

Executor heartbeat timed out error message #38

rajitz opened this issue Apr 21, 2020 · 0 comments

Comments

@rajitz
Copy link

rajitz commented Apr 21, 2020

Hi, I'm running the cnv module with the following parameters:

deca-submit --master local[16] --conf spark.local.dir=/data/cnv/temp --conf spark.driver.maxResultSize=0 --conf spark.kryo.registrationRequired=true --executor-memory 32G --driver-memory 16G -- cnv -I $bam_list_dir/test_allrefs_females.list -l -o "CNVs_"$BATCH"_Females_withAllRefSamples.gff3" -L /data/cnv/reference_files/target_padded_exons_with_transcripts.bed

The machine has 40 cores and 64 GB ran available. This command is failing when it is run concurrently with other tools that take up 32 cores and are also memory intensive; if no other tools are running, then the DECA command above runs successfully all the way. These are the error messages from the failed attempt:

20/04/20 20:16:39 ERROR TaskSetManager: Task 47 in stage 47.0 failed 1 times; aborting job
20/04/20 20:16:41 INFO DAGScheduler: ShuffleMapStage 47 (mapPartitions at Coverage.scala:173) failed in 6957.414 s due to Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 msorg.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 ms
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 47 in stage 47.0 failed 1 times, most recent failure: Lost task 47.0 in stage 47.0 (TID 3017, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 299196 ms

As you can see in the command, we are using 16 threads, --executor-memory 32G and --driver-memory 16G. In order to ensure that the command runs successfully even when other tools are running, which of these parameters would you recommend to decrease from their current settings? Could you also please briefly describe the difference between the executor and driver memory - looks like the executor is per process - does that mean per thread?

Thanks very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant