diff --git a/docs/proposals/petset.md b/docs/proposals/petset.md new file mode 100644 index 000000000000..5e5fcf4ce5c8 --- /dev/null +++ b/docs/proposals/petset.md @@ -0,0 +1,396 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# PetSets: Running pods which need strong identity and storage + +## Open Issues + +* Add examples +* Discuss failure modes for various types of clusters +* Provide an active-active example +* Templating proposals need to be argued through to reduce options + +## Motivation + +Many examples of clustered software systems require stronger guarantees per instance than are provided +by the Replication Controller (aka Replication Controllers). Instances of these systems typically require: + +1. Data per instance which should not be lost even if the pod is deleted, typically on a persistent volume + * Some cluster instances may have tens of TB of stored data - forcing new instances to replicate data + from other members over the network is onerous +2. A stable and unique identity associated with that instance of the storage - such as a unique member id +3. A consistent network identity that allows other members to locate the instance even if the pod is deleted +4. A predictable number of instances to ensure that systems can form a quorum + * This may be necessary during initialization +5. Ability to migrate from node to node with stable network identity (DNS name) +6. The ability to scale up in a controlled fashion, but are very rarely scaled down without human + intervention + +Kubernetes should expose a pod controller (a PetSet) that satisfies these requirements in a flexible +manner. It should be easy for users to manage and reason about the behavior of this set. An administrator +with familiarity in a particular cluster system should be able to leverage this controller and its +supporting documentation to run that clustered system on Kubernetes. It is expected that some adaptation +is required to support each new cluster. + + +## Use Cases + +The software listed below forms the primary use-cases for a PetSet on the cluster - problems encountered +while adapting these for Kubernetes should be addressed in a final design. + +* Quorum with Leader Election + * MongoDB - in replica set mode forms a quorum with an elected leader, but instances must be preconfigured + and have stable network identities. + * ZooKeeper - forms a quorum with an elected leader, but is sensitive to cluster membership changes and + replacement instances *must* present consistent identities + * etcd - forms a quorum with an elected leader, can alter cluster membership in a consistent way, and + requires stable network identities +* Decentralized Quorum + * Cassandra - allows flexible consistency and distributes data via innate hash ring sharding, is also + flexible to scaling, more likely to support members that come and go. Scale down may trigger massive + rebalances. +* Active-active + * Galera - has multiple active masters which must remain in sync +* ??? + + +## Background + +Replica sets are designed with a weak guarantee - that there should be N replicas of a particular +pod template. Each pod instance varies only by name, and the replication controller errs on the side of +ensuring that N replicas exist as quickly as possible (by creating new pods as soon as old ones begin graceful +deletion, for instance, or by being able to pick arbitrary pods to scale down). In addition, pods by design +have no stable network identity other than their assigned pod IP, which can change over the lifetime of a pod +resource. ReplicaSets are best leveraged for stateless, shared-nothing, zero-coordination, +embarassingly-parallel, or fungible software. + +While it is possible to emulate the guarantees described above by leveraging multiple replication controllers +(for distinct pod templates and pod identities) and multiple services (for stable network identity), the +resulting objects are hard to maintain and must be copied manually in order to scale a cluster. + +By constrast, a DaemonSet *can* offer some of the guarantees above, by leveraging Nodes as stable, long-lived +entities. An administrator might choose a set of nodes, label them a particular way, and create a +DaemonSet that maps pods to each node. The storage of the node itself (which could be network attached +storage, or a local SAN) is the persistent storage. The network identity of the node is the stable +identity. However, while there are examples of clustered software that benefit from close association to +a node, this creates an undue burden on administrators to design their cluster to satisfy these +constraints, when a goal of Kubernetes is to decouple system administration from application management. + + +## Design Assumptions + +* **Specialized Controller** - Rather than increase the complexity of the ReplicaSet to satisfy two distinct + use cases, create a new resource that assists users in solving this particular problem. +* **No built-in update** - Updating clustered software can be complex, since existing software may dictate + certain orchestration occur as each instance is created. For now, assume that updates to the PetSet are + driven by external or innate orchestration +* **Safety first** - Running a clustered system on Kubernetes should be no harder + than running a clustered system off Kube. Authors should be given tools to guard against common cluster + failure modes (split brain, phantom member) to prevent introducing more failure modes. Sophisticated + distributed systems designers can implement more sophisticated solutions than PetSet if necessary - + new users should not become vulnerable to additional failure modes through an overly flexible design. +* **Limited scaling** - While flexible scaling is important for some clusters, other examples of clusters + do not change scale without significant external intervention. Human intervention may be required after + scaling. Changing scale during cluster operation can lead to split brain in quorum systems. It should be + possible to scale easily, but the details of making that safe belong to the pods and image authors. +* **No generic cluster lifecycle** - Rather than design a general purpose lifecycle for clustered software, + focus on ensuring the information necessary for the software to function is available. For example, + rather than providing a "post-creation" hook invoked when the cluster is complete, provide the necessary + information to the "first" (or last) pod to determine the identity of the remaining cluster members and + allow it to manage its own initialization. +* **External access direct to cluster members is out of scope** - exposing pods to consumers that cannot + access the pod network is out of scope, because we do not currently support headless services being + exposed via NodePort. A workaround is to allow external clients to access the pod network, or to + create one NodePort service per member. A future design should cover headless external service access. + + +## Proposed Design + +Add a new resource to Kubernetes to represent a set of pods that are individually distinct but each +individual can safely be replaced-- the name **PetSet** (working name) is chosen to convey that the +individual members of the set are themselves "pets" and thus each one is preserved. A relevant analogy +is that a PetSet is composed of pets, but the pets are like goldfish. If you have a blue, red, and +yellow goldfish, and the red goldfish dies, you replace it with another red goldfish and no one would +notice. If you suddenly have three red goldfish, someone will notice. + +The PetSet is responsible for creating and maintaining a set of **identities** and ensuring that there is +one pod and zero or more **supporting resources** for each identity. There should never be more than one pod +or unique supporting resource per identity at any one time. A new pod can be created for an identity only +if a previous pod has been fully terminated (reached its graceful termination limit or cleanly exited). + +A PetSet has 0..N **members**, each with a unique **identity** which is a name that is unique within the +set. + +``` +type PetSet struct { + ObjectMeta + + Spec PetSetSpec + ... +} + +type PetSetSpec struct { + // Replicas is the desired number of replicas of the given template. + // Each replica is assigned a unique name of the form `name-$replica` + // where replica is in the range `0 - (replicas-1)`. + Replicas int + + // A label selector that "owns" objects created under this set + Selector *LabelSelector + + // Template is the object describing the pod that will be created - each + // pod created by this set will match the template, but have a unique identity. + Template *PodTemplateSpec + + // VolumeClaimTemplates is a list of claims that pets are allowed to reference. + // The PetSet controller is responsible for mapping network identities to + // claims in a way that maintains the identity of a pet. Every claim in + // this list must have at least one matching (by name) volumeMount in one + // container in the template. A claim in this list takes precedence over + // any volumes in the template, with the same name. + VolumeClaimTemplates []PersistentVolumeClaim + + // ServiceName is the name of the service that governs this PetSet. + // This service must exist before the PetSet, and is responsible for + // the network identity of the set. Pets get DNS/hostnames that follow the + // pattern: pet-specific-string.serviceName.default.svc.cluster.local + // where "pet-specific-string" is managed by the PetSet controller. + ServiceName string +} +``` + +Like a replication controller, a PetSet may be targeted by an autoscaler. The PetSet makes no assumptions +about upgrading or altering the pods in the set for now - instead, the user can trigger graceful deletion +and the PetSet will replace the terminated member with the newer template once it exits. Future proposals +may offer update capabilities. A PetSet requires RestartAlways pods. The addition of forgiveness may be +necessary in the future to increase the safety of the controller recreating pods. + + +### How identities are managed + +A key question is whether scaling down a PetSet and then scaling it back up should reuse identities. If not, +scaling down becomes a destructive action (an admin cannot recover by scaling back up). Given the safety +first assumption, identity reuse seems the correct default. This implies that identity assignment should +be deterministic and not subject to controller races (a controller that has crashed during scale up should +assign the same identities on restart, and two concurrent controllers should decide on the same outcome +identities). + +The simplest way to manage identities, and easiest to understand for users, is a numeric identity system +starting at I=0 that ranges up to the current replica count and is contiguous. + +Future work: + +* Cover identity reclamation - cleaning up resources for identities that are no longer in use. +* Allow more sophisticated identity assignment - instead of `{name}-{0 - replicas-1}`, allow subsets and + complex indexing. + +### Controller behavior. + +When a PetSet is scaled up, the controller must create both pods and supporting resources for +each new identity. The controller must create supporting resources for the pod before creating the +pod. If a supporting resource with the appropriate name already exists, the controller should treat that as +creation succeeding. If a supporting resource cannot be created, the controller should flag an error to +status, back-off (like a scheduler or replication controller), and try again later. Each resource created +by a PetSet controller must have a set of labels that match the selector, support orphaning, and have a +controller back reference annotation identifying the owning PetSet by name and UID. + +When a PetSet is scaled down, the pod for the removed indentity should be deleted. It is less clear what the +controller should do to supporting resources. If every pod requires a PV, and a user accidentally scales +up to N=200 and then back down to N=3, leaving 197 PVs lying around may be undesirable (potential for +abuse). On the other hand, a cluster of 5 that is accidentally scaled down to 3 might irreparably destroy +the cluster if the PV for identities 4 and 5 are deleted (may not be recoverable). For the initial proposal, +leaving the supporting resources is the safest path (safety first) with a potential future policy applied +to the PetSet for how to manage supporting resources (DeleteImmediately, GarbageCollect, Preserve). + +The controller should reflect summary counts of resources on the PetSet status to enable clients to easily +understand the current state of the set. + +### Parameterizing pod templates and supporting resources + +Since each pod needs a unique and distinct identity, and the pod needs to know its own identity, the +PetSet must allow a pod template to be parameterized by the identity assigned to the pod. The pods that +are created should be easily identified by their cluster membership. + +Because that pod needs access to stable storage, the PetSet may specify a template for one or more +**persistent volume claims** that can be used for each distinct pod. The name of the volume claim must +match a volume mount within the pod template. + +Future work: + +* In the future other resources may be added that must also be templated - for instance, secrets (unique secret per member), config data (unique config per member), and in the futher future, arbitrary extension resources. +* Consider allowing the identity value itself to be passed as an environment variable via the downward API +* Consider allowing per identity values to be specified that are passed to the pod template or volume claim. + + +### Accessing pods by stable network identity + +In order to provide stable network identity, given that pods may not assume pod IP is constant over the +lifetime of a pod, it must be possible to have a resolvable DNS name for the pod that is tied to the +pod identity. There are two broad classes of clustered services - those that require clients to know +all members of the cluster (load balancer intolerant) and those that are amenable to load balancing. +For the former, clients must also be able to easily enumerate the list of DNS names that represent the +member identities and access them inside the cluster. Within a pod, it must be possible for containers +to find and access that DNS name for identifying itself to the cluster. + +Since a pod is expected to be controlled by a single controller at a time, it is reasonable for a pod to +have a single identity at a time. Therefore, a service can expose a pod by its identity in a unique +fashion via DNS by leveraging information written to the endpoints by the endpoints controller. + +The end result might be DNS resolution as follows: + +``` +# service mongo pointing to pods created by PetSet mdb, with identities mdb-1, mdb-2, mdb-3 + +dig mongodb.namespace.svc.cluster.local +short A +172.130.16.50 + +dig mdb-1.mongodb.namespace.svc.cluster.local +short A +# IP of pod created for mdb-1 + +dig mdb-2.mongodb.namespace.svc.cluster.local +short A +# IP of pod created for mdb-2 + +dig mdb-3.mongodb.namespace.svc.cluster.local +short A +# IP of pod created for mdb-3 +``` + +This is currently implemented via an annotation on pods, which is surfaced to endpoints, and finally +surfaced as DNS on the service that exposes those pods. + +``` +// The pods created by this PetSet will have the DNS names "mysql-0.NAMESPACE.svc.cluster.local" +// and "mysql-1.NAMESPACE.svc.cluster.local" +kind: PetSet +metadata: + name: mysql +spec: + replicas: 2 + serviceName: db + template: + spec: + containers: + - image: mysql:latest + +// Example pod created by petset +kind: Pod +metadata: + name: mysql-0 + annotations: + pod.beta.kubernetes.io/hostname: "mysql-0" + pod.beta.kubernetes.io/subdomain: db +spec: + ... +``` + + +### Preventing duplicate identities + +The PetSet controller is expected to execute like other controllers, as a single writer. However, when +considering designing for safety first, the possibility of the controller running concurrently cannot +be overlooked, and so it is important to ensure that duplicate pod identities are not achieved. + +There are two mechanisms to acheive this at the current time. One is to leverage unique names for pods +that carry the identity of the pod - this prevents duplication because etcd 2 can guarantee single +key transactionality. The other is to use the status field of the PetSet to coordinate membership +information. It is possible to leverage both at this time, and encourage users to not assume pod +name is significant, but users are likely to take what they can get. A downside of using unique names +is that it complicates pre-warming of pods and pod migration - on the other hand, those are also +advanced use cases that might be better solved by another, more specialized controller (a +MigratablePetSet). + + +### Managing lifecycle of members + +The most difficult aspect of managing a pet set is ensuring that all members see a consistent configuration +state of the set. Without a strongly consistent view of cluster state, most clustered software is +vulnerable to split brain. For example, a new set is created with 3 members. If the node containing the +first member is partitioned from the cluster, it may not observe the other two members, and thus create its +own cluster of size 1. The other two members do see the first member, so they form a cluster of size 3. +Both clusters appear to have quorum, which can lead to data loss if not detected. + +PetSets should provide basic mechanisms that enable a consistent view of cluster state to be possible, +and in the future provide more tools to reduce the amount of work necessary to monitor and update that +state. + +The first mechanism is that the PetSet controller blocks creation of new pods until all previous pods +are reporting a healthy status. The PetSet controller uses the strong serializability of the underyling +etcd storage to ensure that it acts on a consistent view of the cluster membership (the pods and their) +status, and serializes the creation of pods based on the health state of other pods. This simplifies +reasoning about how to initialize a PetSet, but is not sufficient to guarantee split brain does not +occur. + +The second mechanism is having each "pet" use the state of the cluster and transform that into cluster +configuration or decisions about membership. This is currently implemented using a side car container +that watches the master (via DNS today, although in the future this may be to endpoints directly) to +receive an ordered history of events, and then applying those safely to the configuration. Note that +for this to be safe, the history received must be strongly consistent (must be the same order of +events from all observers) and the config change must be bounded (an old config version may not +be allowed to exist forever). For now, this is known as a 'babysitter' (working name) and is intended +to help identify abstractions that can be provided by the PetSet controller in the future. + + +## Future Evolution + +Criteria for advancing to beta: + +* Identify common abstraction points from the 'babysitter' +* One non-database example +* Make it easy to deploy PetSets without cloud provider storage being required (local storage) +* Basic rolling-update style upgrades + +Criteria for advancing to GA: + +* PetSets solve 80% of clustered software configuraton with minimal input from users and are safe from common split brain problems + * Several representative examples of PetSets from the community have been proven/tested to be "correct" for a variety of partition problems (possibly via Jepsen or similar) +* PetSets are considered easy to use for deploying clustered software for common cases + +Requested features: + +* IPs per pet for clustered software like Cassandra that cache resolved DNS addresses that can be used outside the cluster (scope growth) +* Send more / simpler events to each pod from a central spot via the "signal API" +* Persistent local volumes that can leverage local storage +* Allow pods within the PetSet to identify "leader" in a way that can direct requests from a service to a particular member. +* Provide upgrades of a PetSet in a controllable way (like Deployments). + + +## Overlap with other proposals + +* Jobs can be used to perform a run-once initialization of the cluster +* Init containers can be used to prime PVs and config with the identity of the pod. +* Templates and how fields are overriden in the resulting object should have broad alignment +* DaemonSet defines the core model for how new controllers sit alongside replication controller and + how upgrades can be implemented outside of Deployment objects. + + + + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/petset.md?pixel)]() +