Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle subnet lease getting expired #29

Closed
eyakubovich opened this issue Aug 29, 2014 · 6 comments
Closed

Handle subnet lease getting expired #29

eyakubovich opened this issue Aug 29, 2014 · 6 comments

Comments

@eyakubovich
Copy link
Contributor

Although flannel will start renewing the lease an hour prior to expiration, it could still get lost: e.g. VM getting suspended. Flannel should try to get the same subnet assignment if it's still available but fall back to a new lease and signal the fact.

@eyakubovich eyakubovich added this to the 1.0 milestone Aug 21, 2015
@macb
Copy link

macb commented Mar 7, 2016

Is there any work under way for this? It'd be incredibly useful as right now if a machine loses a lease and gets a new one it renders any containers on the machine with no network connectivity.

@tomdee
Copy link
Contributor

tomdee commented Apr 27, 2017

One implementation idea for this is in #610

@tomdee
Copy link
Contributor

tomdee commented Apr 27, 2017

Also see #520 for some good questions about how flannel handles this at the moment.

@tomdee
Copy link
Contributor

tomdee commented Apr 27, 2017

When fixing this, we should make sure this failure scenario is discussed clearly in the docs.

@rosenhouse
Copy link

FWIW, the system design that we've converged on for Cloud Foundry is that hosts are preferentially assigned their prior lease, even if it "expired." And if a new host appears, it is assigned a lease in the following priority order:

  • prefer subnets that have never been given out before, or subnets which were explicitly relinquished by a cleanly-terminating host.
  • if none of those exist, only then does the new host take over an expired lease, and in that case it chooses the oldest such lease.

This is meant to minimize the probability that a lease is "stolen" from a live, but partitioned, container host. But if that does occur, once the partition heals and the "victim" host re-connects, it will discover that its lease is no longer valid. In this case, the victim host falls into a special, noisy failure mode which will (1) prevent any new workloads from being scheduled and (2) trigger the orchestration system to evacuate any existing workloads. Once the evacuation is complete, the host will clean up any leftover networking state (e.g. remove the VXLAN device), acquire a new lease for itself and begin accepting new workloads.

We think this is the right plan. Feedback welcome.

mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes flannel-io#610 flannel-io#29
mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes flannel-io#610 flannel-io#29
mgleung added a commit to mgleung/flannel that referenced this issue Jun 19, 2017
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes flannel-io#610 flannel-io#29
mgleung added a commit to mgleung/flannel that referenced this issue Jun 22, 2017
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes flannel-io#610 flannel-io#29
mgleung added a commit to mgleung/flannel that referenced this issue Jun 22, 2017
Added feature to allow flannel to restart in case of etcd failures and
still keep the same subnet address for the hosts.

Fixes flannel-io#610 flannel-io#29
@tomdee
Copy link
Contributor

tomdee commented Jul 12, 2017

This is now fixed in v0.8.0

@tomdee tomdee closed this as completed Jul 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants