Skip to content

simple implementation for spark watchdogs for etl processing

License

Notifications You must be signed in to change notification settings

IgorBerman/sparkwatchdogs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sparkwatchdogs

simple implementation for spark watchdogs for etl processing.

Motivation

We have production etl jobs that run all day long. The system should be as automatic as possible, without human intervention. We still trying to get this level, so we constantly building additional self-checks or as I call them "watchdogs". Many times it means that just killing some stage/job/application is better than just finding out that due to one stage the whole pipeline did nothing in last few hours and the whole pipeline is in huge delay.

Imprortant notice

See pom.xml, all 3'd party libs are "provided", so to use it you need to add them to your fat jar explicitly or to your deployment lib. This will permit you to control dependencies, without adding to your dependency hell yet another problem.

Example

see org.apache.spark.watchdogs.Watchdogs as usage example

What we have currently

Currently there are two watchodgs:

  • CoresWaitingWatchdog - Probably due to some bug in spark core, sometime there is a situation when the cores are free and available, but application can't get them. In this scenario it waiting for cores without terminating. When such application is killed, next retry to submit exactly same application gets all necessary cores and proceedes without a problem. You can deploy some cron based solution to retry the submission of same application. Pay attention that there are all kind of different problems why this situation is "normal", e.g. you may have firewall problem and driver can't connect to master and get resources(there are plenty of stackoverflow questions regarding this problem). Make sure to use this watchdog if you 100% sure that your cluster setup is ok.
  • SparkStageHangingWatchdog - Sometimes some stage hangs. Due to other reasons(see spark-gotchas) we can't use speculation mode, so we prefer just to cancel some stage(so probably the whole job will be canelled)

About

simple implementation for spark watchdogs for etl processing

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages