Skip to content
pengzhang edited this page Sep 4, 2017 · 1 revision

title: Straggler Mitigation

Guagua

Straggler Mitigation

Straggler Mitigation By Enabling Fault Tolerance

Our first solution is to detect slow workers (in 3 iterations and run time per each is >= 60s). And then kill the worker to enable fault tolerance which will restart slow workers in another node.

This solution works but not well, in a busy shared Hadoop cluster, more workres will be restarted while restarting needs extra cost to start a procss or loading data into memory.

Straggler Mitigation By Enabling Fault Tolerance

An Effective Straggler Mitigation Method: Partial Complete

An effective straggler mitigation method is just to skip the slow workers. By defining a parameter guagua.min.workers.ratio, by default it is 0.95 which means each iteration master only waits for 95% workers to be finished. This is very important to only skip stragglers. Check this feature in below slide, the slower worker is skipped in iteration 2 and 3, but laster it is good in iteration 4.

An Effective Straggler Mitigation Method: Partial Complete

An Effective Straggler Mitigation Method: Partial Complete