Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: System recovery on node failure #303

Open
a8t3r opened this issue Jul 14, 2015 · 2 comments
Open

Feature request: System recovery on node failure #303

a8t3r opened this issue Jul 14, 2015 · 2 comments
Labels

Comments

@a8t3r
Copy link

a8t3r commented Jul 14, 2015

Thanks a lot for your product!

Currently i'm looking for troubleshooting and system recovery technics over glu infrastructure.
Take a look at marathon:

Imagine that one of the datacenter workers trips over a power cord and a server gets unplugged.
No problem for Marathon, it moves the affected search service and Rails tasks to a node that has spare capacity. The engineer may be temporarily embarrased, but Marathon saves him from having to explain a difficult situation!

So, imagine that we have only three working nodes (N1, N2, N3) with three services (S1, S2, S3) running on them, one for each (N1 -> S1, N2 -> S2, N3 -> S3). Resource utilization on nodes equals U1, U2, U3.
At some moment node N2 downfalls and service S2 takes state 'UNDEPLOYED' at glu console. Ok, we've got a problem! According to reserved resource capacity at another nodes (for example, U1 << U3), glu orchestration engine automatically takes a decision to redeploy service S2 from node N2 to node N1.

Definitely i can write custom zk listener for that purposes, but i think this feature will be more useful at glu core.

@ypujante
Copy link
Member

I think I understand what you are asking. glu has been designed from the very beginning to not do anything automatically. It was a conscious decision. glu has been designed as a platform on top of which you can build this kind of behavior.

In your case, what would need to happen: the static model needs to be modified and glu needs to "deploy" it. It is not very obvious how to handle the modification of the model in an automated fashion since it is essentially a black box to glu and would obviously be very different from customer to customer.

Where I think glu could help would be in providing a hook to plug in some behavior that is triggered when errors are detected. As you mention you can listen to ZooKeeper yourself outside of glu but it seems that since glu is notified of errors, it might be easier to provide a hook directly in glu.

I will think about it.

Yan

@adamtulinius
Copy link

You can't just start software up on another node blindly. Yes, Marathon will do that, but it obviously requires the software to be stateless (or at least use non-local storage for whatever it does).

With regards to Mesos, it might be interesting to implement af mesos framework (or just executor) for glu. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants