straggler
Our first solution is to detect slow workers (in 3 iterations and run time per each is >= 60s). And then kill the worker to enable fault tolerance which will restart slow workers in another node.
This solution works but not well, in a busy shared Hadoop cluster, more workres will be restarted while restarting needs extra cost to start a procss or loading data into memory.
An effective straggler mitigation method is just to skip the slow workers. By defining a parameter guagua.min.workers.ratio
, by default it is 0.95 which means each iteration master only waits for 95% workers to be finished. This is very important to only skip stragglers. Check this feature in below slide, the slower worker is skipped in iteration 2 and 3, but laster it is good in iteration 4.