etcd cluster startup flakiness with TLS enabled #2163

jpbetz · 2020-02-25T04:43:22Z

Reported by @matte21 on kubernetes/kubernetes#81508 (comment):

I used to deploy a 3-member etcd v3.3.11 cluster on a Kubernetes 1.16 IKS cluster. The etcd cluster was used to store CRs defined via API aggregation, and was managed via coreOS's etcd operator v0.9.4.

I had successfully deployed the etcd cluster and accessed the API objects stored in it countless times on the very same IKS cluster, but one day, after deploying it, the weird behavior began. More precisely, the first member would always start up successfully, but the second member encountered an error and got stuck in failed state without being replaced. Its logs terminated with

2020-02-10 22:23:40.466759 W | etcdserver: could not get cluster response from https://the-etcd-cluster-qdwdv2ctn7.the-etcd-cluster.example-com.svc:2380: Get https://the-etcd-cluster-qdwdv2ctn7.the-etcd-cluster.example-com.svc:2380/members: EOF
2020-02-10 22:23:40.467349 C | etcdmain: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls

where the-etcd-cluster-qdwdv2ctn7 is the first member of the etcd cluster. The third member's pod was never created. This behavior happened around 80% of the times. In the remaining 20% the cluster got up and running and operated correctly; the 1st and second members of the cluster were up and running immediately, while the third member failed to start, got replaced (too quickly for me to intercept its logs) and then started correctly. Both the 2nd and 3rd member logs contained messages like:

2020-02-11 14:11:03.413970 I | embed: rejected connection from "172.30.171.153:47446" (error "tls: "172.30.171.153" does not match any of DNSNames [".the-etcd-cluster.example-com.svc" ".the-etcd-cluster.example-com.svc.cluster.local"]", ServerName "the-etcd-cluster-nqkqzxqgzt.the-etcd-cluster.example-com^C I | embed: rejected connection from "172.30.171.153:47458" (error "tls: "172.30.171.153" does not match any of DNSNames [".the-etcd-cluster.example-com.svc" ".the-etcd-cluster.example-com.svc.cluster.local"]", ServerName "the-etcd-cluster-nqkqzxqgzt.the-etcd-cluster.example-com.svc", IPAddresses [], DNSNames [".the-etcd-cluster.example-com.svc" ".the-etcd-cluster.example-com.svc.cluster.local"])

I observed the issue only when TLS was enabled, I did many trials w/o TLS and the cluster started and operated perfectly fine.
I also tried many deployments with TLS on but different etcd versions. I observed the issue with
v3.2.0 and v3.4.3, never with v3.1.15, v3.1.16 and v3.1.20.

After some experiments without solving the issue I deleted my IKS cluster and got a new one with Kubernetes 1.15. I've never seen the issue again since then.

For completeness:
manifest of the etcd cluster: https://github.com/MikeSpreitzer/kube-examples/blob/add-kos/staging/kos/deploy/etcd-cluster/48-etcd.yaml
manifest of the etcd operator: https://github.com/MikeSpreitzer/kube-examples/blob/add-kos/staging/kos/deploy/etcd-operator/47-eo-theop.yaml

cc @MikeSpreitzer @hexfusion

The text was updated successfully, but these errors were encountered:

jpbetz · 2020-02-25T04:49:27Z

@matte21 Can you provide any details about what certs etcd-operator generated in this case? Do you know what dns names were in the Subject Alternative Names?

matte21 · 2020-02-25T11:32:25Z

The etcd cluster members used static TLS:

apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
  name: the-etcd-cluster
  namespace: example-com
spec:
  size: 3
  version: "3.3.11"
  TLS:
    static:
      member:
        peerSecret: etcd-peer
        serverSecret: etcd-server
      operatorSecret: etcd-client
...

The certificates and the secrets carrying them where created by a deployment script. Follow the cnf for the three certs.

Peer:

[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = *.the-etcd-cluster.example-com.svc
DNS.2 = *.the-etcd-cluster.example-com.svc.cluster.local

Server:

[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = *.the-etcd-cluster.example-com.svc
DNS.2 = the-etcd-cluster-client.example-com.svc
DNS.3 = the-etcd-cluster-client
DNS.4 = localhost

Client:

[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth

matte21 · 2020-02-25T11:35:33Z

As an addition: after the second member of the etcd cluster got stuck in failed state, the etcd operator logs contained:

time="2020-02-11T13:57:33Z" level=info msg="Start reconciling" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster
time="2020-02-11T13:57:33Z" level=info msg="running members: the-etcd-cluster-bxbxwtwvpf" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster
time="2020-02-11T13:57:33Z" level=info msg="cluster membership: the-etcd-cluster-rnq8c9wtzz,the-etcd-cluster-bxbxwtwvpf" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster
time="2020-02-11T13:57:33Z" level=info msg="Finish reconciling" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster
time="2020-02-11T13:57:33Z" level=error msg="failed to reconcile: lost quorum" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster

matte21 · 2020-02-25T15:40:29Z

On @MikeSpreitzer 's prompt, here's a list of issues for whom the observed behavior was similar to the one described in this issue:
#1330
#1300
#1962
etcd-io/etcd#8803
etcd-io/etcd#8268
kubernetes/kops#6024

myazid · 2020-03-24T23:19:54Z

Is there a solution for this issue?

matte21 · 2020-03-25T17:19:23Z

@myazid not that I know of. Are you experiencing it? If so, could you post more details?

myazid · 2020-03-26T05:20:09Z

I saw this error in the logs

2020-03-22 23:32:33.713411 I | embed: rejected connection from "10.244.1.13:53494" (error "tls: \"10.244.1.13\" does not match any of DNSNames [\"*.orchestrator-etcd-cluster.default.svc\" \"*.orchestrator-etcd-cluster.default.svc.cluster.local\"]", ServerName "orchestrator-etcd-cluster-2q52srppbq.orchestrator-etcd-cluster.default.svc", IPAddresses [], DNSNames ["*.orchestrator-etcd-cluster.default.svc" "*.orchestrator-etcd-cluster.default.svc.cluster.local"])

ectd did a reverse lookup by IP and I test lookup by IP

nslookup 10.244.1.3
3.1.244.10.in-addr.arpa    name = 10-244-1-3.orchestrator-etcd-cluster-client.default.svc.cluster.local.

domain is not in the DNSNames.

Server TLS define
CN: *.orchestrator-etcd-cluster.default.svc
SANs: *.orchestrator-etcd-cluster.default.svc.cluster.local, *.orchestrator-etcd-cluster-client.default.svc, *.orchestrator-etcd-cluster-client.default.svc.cluster.local, orchestrator-etcd-cluster-client.default.svc.cluster.local, localhost

etcd-operator: 0.9.4
etcd: v3.4.5

Running on K8s v1.17.0

ArneMancofi · 2020-03-26T07:50:02Z

Observation: I was experiencing the same problem. However, when I configured PVCs, the etcd pod would take significant time (~50s) to start after it had been spawned. In this case, the DNS name mismatch did not occur.
Hypothesis: Could it be the case that if the newly spawned pod is too quick in connecting to the existing pod, the DNS request is resolved by a cache that has not had time to pick up the recent IP addr/DNS name addition?

jpbetz mentioned this issue Feb 25, 2020

is kubernetes 1.9.2 don't support --etcd-servers-overrides crd type kubernetes/kubernetes#81508

Closed

csrwng mentioned this issue Mar 2, 2020

Investigate why etcd cluster fails to create in OCP 4.4 openshift/hypershift-toolkit#122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd cluster startup flakiness with TLS enabled #2163

etcd cluster startup flakiness with TLS enabled #2163

jpbetz commented Feb 25, 2020

jpbetz commented Feb 25, 2020

matte21 commented Feb 25, 2020

matte21 commented Feb 25, 2020

matte21 commented Feb 25, 2020 •

edited

myazid commented Mar 24, 2020

matte21 commented Mar 25, 2020

myazid commented Mar 26, 2020 •

edited

ArneMancofi commented Mar 26, 2020

etcd cluster startup flakiness with TLS enabled #2163

etcd cluster startup flakiness with TLS enabled #2163

Comments

jpbetz commented Feb 25, 2020

jpbetz commented Feb 25, 2020

matte21 commented Feb 25, 2020

matte21 commented Feb 25, 2020

matte21 commented Feb 25, 2020 • edited

myazid commented Mar 24, 2020

matte21 commented Mar 25, 2020

myazid commented Mar 26, 2020 • edited

ArneMancofi commented Mar 26, 2020

matte21 commented Feb 25, 2020 •

edited

myazid commented Mar 26, 2020 •

edited