Skip to content
This repository has been archived by the owner on Aug 25, 2021. It is now read-only.

Pod is ready, but readiness probe is unhealthy for both server and client #687

Closed
yevgeniyo opened this issue Nov 16, 2020 · 3 comments
Closed
Labels
waiting-on-response Waiting on the issue creator for a response before taking further action

Comments

@yevgeniyo
Copy link

yevgeniyo commented Nov 16, 2020

Chart 0.26.0
Kubernetes: EKS 1.18

Pod is ready, but readiness probe is unhealthy for both server and client

kubectl get pod
                                                                                                                                                                                   
NAME                               READY   STATUS    RESTARTS   AGE
hashicorp-consul-consul-hbdbr      1/1     Running   0          41m
hashicorp-consul-consul-nxfrv      1/1     Running   0          44m
hashicorp-consul-consul-q558m      1/1     Running   0          41m
hashicorp-consul-consul-rrlmq      1/1     Running   0          44m
hashicorp-consul-consul-server-0   1/1     Running   0          44m
hashicorp-consul-consul-server-1   1/1     Running   0          44m
hashicorp-consul-consul-server-2   1/1     Running   0          44m
kubectl describe pod hashicorp-consul-consul-server-0
                                                                                                                                                    
Name:         hashicorp-consul-consul-server-0
Namespace:    consul
Priority:     0
Node:         ip-10-211-1-106.ec2.internal/10.211.1.106
Start Time:   Mon, 16 Nov 2020 13:29:05 +0200
Labels:       app=consul
              chart=consul-helm
              component=server
              controller-revision-hash=hashicorp-consul-consul-server-5548d7f9d6
              hasDNS=true
              release=hashicorp-consul
              statefulset.kubernetes.io/pod-name=hashicorp-consul-consul-server-0
Annotations:  consul.hashicorp.com/config-checksum: ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356
              consul.hashicorp.com/connect-inject: false
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.211.1.11
IPs:
  IP:           10.211.1.11
Controlled By:  StatefulSet/hashicorp-consul-consul-server
Containers:
  consul:
    Container ID:  docker://cff54250f5f530915c5a2b69116de5f685c9526849ef8110570f6c74b63dc9f1
    Image:         consul:1.8.5
    Image ID:      docker-pullable://consul@sha256:b85322aa8c65355341dd81b5e95d5c0e8468e6419724d4e8a125198d40426a30
    Ports:         8500/TCP, 8501/TCP, 8301/TCP, 8302/TCP, 8300/TCP, 8600/TCP, 8600/UDP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/UDP
    Command:
      /bin/sh
      -ec
      CONSUL_FULLNAME="hashicorp-consul-consul"

      exec /bin/consul agent \
        -advertise="${POD_IP}" \
        -bind=0.0.0.0 \
        -bootstrap-expect=3 \
        -hcl='ca_file = "/consul/tls/ca/tls.crt"' \
        -hcl='cert_file = "/consul/tls/server/tls.crt"' \
        -hcl='key_file = "/consul/tls/server/tls.key"' \
        -hcl='ports { https = 8501 }' \
        -client=0.0.0.0 \
        -config-dir=/consul/config \
        -datacenter=dc1 \
        -data-dir=/consul/data \
        -domain=consul \
        -encrypt="${GOSSIP_KEY}" \
        -hcl="connect { enabled = true }" \
        -ui \
        -retry-join=${CONSUL_FULLNAME}-server-0.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
        -retry-join=${CONSUL_FULLNAME}-server-1.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
        -retry-join=${CONSUL_FULLNAME}-server-2.${CONSUL_FULLNAME}-server.${NAMESPACE}.svc \
        -server

    State:          Running
      Started:      Mon, 16 Nov 2020 13:29:12 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  100Mi
    Requests:
      cpu:      100m
      memory:   100Mi
    Readiness:  exec [/bin/sh -ec curl \
  --cacert /consul/tls/ca/tls.crt \
  https://127.0.0.1:8501/v1/status/leader \
2>/dev/null | grep -E '".+"'
] delay=5s timeout=5s period=3s #success=1 #failure=2
    Environment:
      POD_IP:             (v1:status.podIP)
      NAMESPACE:         consul (v1:metadata.namespace)
      GOSSIP_KEY:        <set to the key 'key' in secret 'consul-gossip-encryption-key'>  Optional: false
      CONSUL_HTTP_ADDR:  https://localhost:8501
      CONSUL_CACERT:     /consul/tls/ca/tls.crt
    Mounts:
      /consul/config from config (rw)
      /consul/data from data-consul (rw)
      /consul/tls/ca/ from consul-ca-cert (ro)
      /consul/tls/server from consul-server-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from hashicorp-consul-consul-server-token-z9jzx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  data-consul:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-consul-hashicorp-consul-consul-server-0
    ReadOnly:   false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      hashicorp-consul-consul-server-config
    Optional:  false
  consul-ca-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hashicorp-consul-consul-ca-cert
    Optional:    false
  consul-server-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hashicorp-consul-consul-server-cert
    Optional:    false
  hashicorp-consul-consul-server-token-z9jzx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hashicorp-consul-consul-server-token-z9jzx
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  product=consul
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                 From                                   Message
  ----     ------                  ----                ----                                   -------
  Normal   Scheduled               39m                 default-scheduler                      Successfully assigned consul/hashicorp-consul-consul-server-0 to ip-10-211-1-106.ec2.internal
  Normal   SuccessfulAttachVolume  39m                 attachdetach-controller                AttachVolume.Attach succeeded for volume "pvc-5fba74c6-3388-4a35-84d1-e065d725761c"
  Normal   Pulled                  39m                 kubelet, ip-10-211-1-106.ec2.internal  Container image "consul:1.8.5" already present on machine
  Normal   Created                 39m                 kubelet, ip-10-211-1-106.ec2.internal  Created container consul
  Normal   Started                 38m                 kubelet, ip-10-211-1-106.ec2.internal  Started container consul
  Warning  Unhealthy               37m (x22 over 38m)  kubelet, ip-10-211-1-106.ec2.internal  Readiness probe failed:
  Warning  Unhealthy               28s (x5 over 25m)   kubelet, ip-10-211-1-106.ec2.internal  Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Healthcheck from server itself is working:

curl --cacert /consul/tls/ca/tls.crt https://127.0.0.1:8501/v1/status/leader 2>/dev/null | grep -E '".+"'
"10.211.1.11:8300"
@kschoche
Copy link
Contributor

Hi @yevgeniyo thanks you for submitting this issue.

The readiness probe for the server is checking that the servers have elected a leader which would imply a quorum has been attained. It is possible that the servers are unable to reach eachother and have not been able to. You could verify this by running e.g. consul members on the servers and verifying they are all showing the same output. It is possible this could be caused by misconfiguration or networking issues.
It would also be helpful if you could attach the custom values file that you provided to helm when deploying, so we can try to rule any misconfigurations.

Thanks!

@kschoche kschoche added the waiting-on-response Waiting on the issue creator for a response before taking further action label Nov 16, 2020
@yevgeniyo
Copy link
Author

yevgeniyo commented Nov 17, 2020

Hi @kschoche, thank you for you reply

Here is my file with values:
consul_values.yaml.txt

Steps:

  1. I tried to see "consul members" and everyone was alive
  2. Now I want to remove one pod, I did manual (or it will do prestop hook doesnt matter) "conusl leave" and container was stopped automatically
OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown
command terminated with exit code 126

but pod is still is in "Running" state

  1. Now I want to remove this pod (without any force)
    Doing:
    kubectl delete pod hashicorp-consul-consul-server-0
    And it's stuck for 40-50 minutes at least.
    Meanwhile readiness probe tries to get healthcheck constantly
  Warning  Unhealthy          2m2s (x61 over 4m57s)  kubelet, ip-10-211-1-100.ec2.internal  Readiness probe failed: OCI runtime exec failed: exec failed: cannot exec a container that has stopped: unknown

I think it can't be removed due to this prestop hook and readiness together, readiness no need to check service once prestop is implemented. I also found this bug of Kubernetes:
kubernetes/kubernetes#51835 (comment)

BTW, once I removed manually readiness healtcheck, everything is works smoothly, but I understand that this is wrong way to work without readiness

Please advise

@yevgeniyo
Copy link
Author

@kschoche I found my issue, it was due to gossip key length, I used 16 instead of 32

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
waiting-on-response Waiting on the issue creator for a response before taking further action
Projects
None yet
Development

No branches or pull requests

2 participants