Feature request: System recovery on node failure #303

a8t3r · 2015-07-14T10:32:51Z

Thanks a lot for your product!

Currently i'm looking for troubleshooting and system recovery technics over glu infrastructure.
Take a look at marathon:

Imagine that one of the datacenter workers trips over a power cord and a server gets unplugged.
No problem for Marathon, it moves the affected search service and Rails tasks to a node that has spare capacity. The engineer may be temporarily embarrased, but Marathon saves him from having to explain a difficult situation!

So, imagine that we have only three working nodes (N1, N2, N3) with three services (S1, S2, S3) running on them, one for each (N1 -> S1, N2 -> S2, N3 -> S3). Resource utilization on nodes equals U1, U2, U3.
At some moment node N2 downfalls and service S2 takes state 'UNDEPLOYED' at glu console. Ok, we've got a problem! According to reserved resource capacity at another nodes (for example, U1 << U3), glu orchestration engine automatically takes a decision to redeploy service S2 from node N2 to node N1.

Definitely i can write custom zk listener for that purposes, but i think this feature will be more useful at glu core.

ypujante · 2015-07-14T18:12:16Z

I think I understand what you are asking. glu has been designed from the very beginning to not do anything automatically. It was a conscious decision. glu has been designed as a platform on top of which you can build this kind of behavior.

In your case, what would need to happen: the static model needs to be modified and glu needs to "deploy" it. It is not very obvious how to handle the modification of the model in an automated fashion since it is essentially a black box to glu and would obviously be very different from customer to customer.

Where I think glu could help would be in providing a hook to plug in some behavior that is triggered when errors are detected. As you mention you can listen to ZooKeeper yourself outside of glu but it seems that since glu is notified of errors, it might be easier to provide a hook directly in glu.

I will think about it.

Yan

adamtulinius · 2015-08-28T09:57:16Z

You can't just start software up on another node blindly. Yes, Marathon will do that, but it obviously requires the software to be stateless (or at least use non-local storage for whatever it does).

With regards to Mesos, it might be interesting to implement af mesos framework (or just executor) for glu. :-)

ypujante added the feature label Jul 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: System recovery on node failure #303

Feature request: System recovery on node failure #303

a8t3r commented Jul 14, 2015

ypujante commented Jul 14, 2015

adamtulinius commented Aug 28, 2015

Feature request: System recovery on node failure #303

Feature request: System recovery on node failure #303

Comments

a8t3r commented Jul 14, 2015

ypujante commented Jul 14, 2015

adamtulinius commented Aug 28, 2015