Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness probe failed: node id file does not exist - pod is not yet bootstrapped #177

Open
pinkeshr opened this issue Jun 22, 2021 · 19 comments

Comments

@pinkeshr
Copy link

Hi, I am trying to set up a Redis cluster on gke using operator but it fails with error: Readiness probe failed: node id file does not exist - pod is not yet bootstrapped.

I have a gke cluster up and running with node-pool of 6 nodes and machine type n1-standard-8.

Steps to reproduce:

  1. Created Redis operator successfully: kubectl apply -f bundle.yaml
  2. When I create RedisEnterpriseCluster with command kubectl apply -f rec.yaml. It fails.
    These are the event logs found from command kubectl describe pod redis-enterprise-0
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        70s (x2 over 70s)  default-scheduler        0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled               68s                default-scheduler        Successfully assigned default/redis-enterprise-0 to gke-redis-cluster-larger-pool-81081033-smfc
  Normal   SuccessfulAttachVolume  63s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-fb06fa95-f378-4e49-a0f0-3a41e67404be"
  Normal   Pulling                 57s                kubelet                  Pulling image "redislabs/redis:6.0.20-69"
  Normal   Pulled                  39s                kubelet                  Successfully pulled image "redislabs/redis:6.0.20-69" in 18.648849889s
  Normal   Created                 25s                kubelet                  Created container redis-enterprise-node
  Normal   Started                 25s                kubelet                  Started container redis-enterprise-node
  Normal   Pulling                 25s                kubelet                  Pulling image "redislabs/operator:6.0.20-4"
  Normal   Pulled                  21s                kubelet                  Successfully pulled image "redislabs/operator:6.0.20-4" in 4.226643732s
  Normal   Created                 18s                kubelet                  Created container bootstrapper
  Normal   Started                 18s                kubelet                  Started container bootstrapper
  Warning  Unhealthy               7s                 kubelet                  Readiness probe failed: node id file does not exist - pod is not yet bootstrapped
pod is not yet bootstrapped

My rec.yaml file looks like this :

apiVersion: app.redislabs.com/v1alpha1
kind: RedisEnterpriseCluster
metadata:
  name: "redis-enterprise"
spec:
  # Add fields here
  nodes: 3
  uiServiceType: LoadBalancer
  redisEnterpriseNodeResources:
    limits:
      cpu: 250m
      memory: 500Mi
    requests:
      cpu: 250m
      memory: 500Mi

I have tried with different CPU limits but still facing the same error.
Please let me know if I am doing something wrong here.

@laurentdroin
Copy link
Contributor

Hi,

What does kubectl get sc -o yaml return?

@pinkeshr
Copy link
Author

It returns this:

apiVersion: v1
items:
- allowVolumeExpansion: true
  apiVersion: storage.k8s.io/v1
  kind: StorageClass
  metadata:
    annotations:
      components.gke.io/component-name: pdcsi
      components.gke.io/component-version: 0.9.6
      components.gke.io/layer: addon
    creationTimestamp: "2021-06-22T09:36:20Z"
    labels:
      addonmanager.kubernetes.io/mode: EnsureExists
      k8s-app: gcp-compute-persistent-disk-csi-driver
    managedFields:
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:metadata:
          f:annotations:
            .: {}
            f:components.gke.io/component-name: {}
            f:components.gke.io/component-version: {}
            f:components.gke.io/layer: {}
          f:labels:
            .: {}
            f:addonmanager.kubernetes.io/mode: {}
            f:k8s-app: {}
        f:parameters:
          .: {}
          f:type: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: kubectl
      operation: Update
      time: "2021-06-22T09:36:20Z"
    name: premium-rwo
    resourceVersion: "307"
    selfLink: /apis/storage.k8s.io/v1/storageclasses/premium-rwo
    uid: a32b3136-b3ce-4014-a283-4aa2ce550375
  parameters:
    type: pd-ssd
  provisioner: pd.csi.storage.gke.io
  reclaimPolicy: Delete
  volumeBindingMode: WaitForFirstConsumer
- allowVolumeExpansion: true
  apiVersion: storage.k8s.io/v1
  kind: StorageClass
  metadata:
    annotations:
      storageclass.kubernetes.io/is-default-class: "true"
    creationTimestamp: "2021-06-22T09:36:20Z"
    labels:
      addonmanager.kubernetes.io/mode: EnsureExists
    managedFields:
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:metadata:
          f:annotations:
            .: {}
            f:storageclass.kubernetes.io/is-default-class: {}
          f:labels:
            .: {}
            f:addonmanager.kubernetes.io/mode: {}
        f:parameters:
          .: {}
          f:type: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: kubectl
      operation: Update
      time: "2021-06-22T09:36:20Z"
    name: standard
    resourceVersion: "314"
    selfLink: /apis/storage.k8s.io/v1/storageclasses/standard
    uid: 460ef44f-a9a1-4e9e-a0f6-890a5342ed97
  parameters:
    type: pd-standard
  provisioner: kubernetes.io/gce-pd
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
- allowVolumeExpansion: true
  apiVersion: storage.k8s.io/v1
  kind: StorageClass
  metadata:
    annotations:
      components.gke.io/layer: addon
      storageclass.kubernetes.io/is-default-class: "false"
    creationTimestamp: "2021-06-22T09:36:20Z"
    labels:
      addonmanager.kubernetes.io/mode: EnsureExists
      k8s-app: gcp-compute-persistent-disk-csi-driver
    managedFields:
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:metadata:
          f:annotations:
            .: {}
            f:components.gke.io/layer: {}
            f:storageclass.kubernetes.io/is-default-class: {}
          f:labels:
            .: {}
            f:addonmanager.kubernetes.io/mode: {}
            f:k8s-app: {}
        f:parameters:
          .: {}
          f:type: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: kubectl
      operation: Update
      time: "2021-06-22T09:36:20Z"
    name: standard-rwo
    resourceVersion: "308"
    selfLink: /apis/storage.k8s.io/v1/storageclasses/standard-rwo
    uid: 2e15ed05-7b2e-4119-8c98-c75dd10c180c
  parameters:
    type: pd-balanced
  provisioner: pd.csi.storage.gke.io
  reclaimPolicy: Delete
  volumeBindingMode: WaitForFirstConsumer
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

@laurentdroin
Copy link
Contributor

Thank you. What is the output of the 2 following commands:
kubectl describe pvc
kubectl describe pv

@pinkeshr
Copy link
Author

Output of kubectl describe pvc :

Name:          redis-enterprise-storage-rec-0
Namespace:     default
StorageClass:  standard
Status:        Bound
Volume:        pvc-c74db61a-5fea-4215-8074-501acae47c77
Labels:        app=redis-enterprise
               redis.io/cluster=rec
               redis.io/role=node
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       <none>
Events:        <none>


Name:          redis-enterprise-storage-recl-0
Namespace:     default
StorageClass:  standard
Status:        Bound
Volume:        pvc-a9a71c6a-8d43-4094-86c1-b17465b8f359
Labels:        app=redis-enterprise
               redis.io/cluster=recl
               redis.io/role=node
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       <none>
Events:        <none>


Name:          redis-enterprise-storage-redcl-0
Namespace:     default
StorageClass:  standard
Status:        Bound
Volume:        pvc-c652edbd-a5f0-449f-a202-00ba3a5e7b7e
Labels:        app=redis-enterprise
               redis.io/cluster=redcl
               redis.io/role=node
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      3Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       <none>
Events:        <none>


Name:          redis-enterprise-storage-redis-enterprise-0
Namespace:     default
StorageClass:  standard
Status:        Bound
Volume:        pvc-fb06fa95-f378-4e49-a0f0-3a41e67404be
Labels:        app=redis-enterprise
               redis.io/cluster=redis-enterprise
               redis.io/role=node
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/gce-pd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      3Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       redis-enterprise-0
Events:        <none>

Output of kubectl describe pv :

Name:              pvc-a9a71c6a-8d43-4094-86c1-b17465b8f359
Labels:            failure-domain.beta.kubernetes.io/region=us-east1
                   failure-domain.beta.kubernetes.io/zone=us-east1-b
Annotations:       kubernetes.io/createdby: gce-pd-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      standard
Status:            Bound
Claim:             default/redis-enterprise-storage-recl-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          20Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-east1-b]
                   failure-domain.beta.kubernetes.io/region in [us-east1]
Message:           
Source:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     gke-redis-cluster-c493-pvc-a9a71c6a-8d43-4094-86c1-b17465b8f359
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>


Name:              pvc-c652edbd-a5f0-449f-a202-00ba3a5e7b7e
Labels:            failure-domain.beta.kubernetes.io/region=us-east1
                   failure-domain.beta.kubernetes.io/zone=us-east1-b
Annotations:       kubernetes.io/createdby: gce-pd-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      standard
Status:            Bound
Claim:             default/redis-enterprise-storage-redcl-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          3Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-east1-b]
                   failure-domain.beta.kubernetes.io/region in [us-east1]
Message:           
Source:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     gke-redis-cluster-c493-pvc-c652edbd-a5f0-449f-a202-00ba3a5e7b7e
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>


Name:              pvc-c74db61a-5fea-4215-8074-501acae47c77
Labels:            failure-domain.beta.kubernetes.io/region=us-east1
                   failure-domain.beta.kubernetes.io/zone=us-east1-b
Annotations:       kubernetes.io/createdby: gce-pd-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      standard
Status:            Bound
Claim:             default/redis-enterprise-storage-rec-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          20Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-east1-b]
                   failure-domain.beta.kubernetes.io/region in [us-east1]
Message:           
Source:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     gke-redis-cluster-c493-pvc-c74db61a-5fea-4215-8074-501acae47c77
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>


Name:              pvc-fb06fa95-f378-4e49-a0f0-3a41e67404be
Labels:            failure-domain.beta.kubernetes.io/region=us-east1
                   failure-domain.beta.kubernetes.io/zone=us-east1-b
Annotations:       kubernetes.io/createdby: gce-pd-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/gce-pd
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      standard
Status:            Bound
Claim:             default/redis-enterprise-storage-redis-enterprise-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          3Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-east1-b]
                   failure-domain.beta.kubernetes.io/region in [us-east1]
Message:           
Source:
    Type:       GCEPersistentDisk (a Persistent Disk resource in Google Compute Engine)
    PDName:     gke-redis-cluster-c493-pvc-fb06fa95-f378-4e49-a0f0-3a41e67404be
    FSType:     ext4
    Partition:  0
    ReadOnly:   false
Events:         <none>

@laurentdroin
Copy link
Contributor

Hi,

I would suggest running the log_collector.py script from this project in order to generate a diagnostic package and open a Support Ticket with Redis Labs to upload the package so that we can analyze it.

Laurent.

@pinkeshr
Copy link
Author

Thanks for the help. I have opened a ticket with Redis Labs for the same.

@pescarcena
Copy link

Hello @pinkeshr,
I have the same issue, did you solve it?

@laurentdroin
Copy link
Contributor

Hi Paul,

This is a very generic message that is always displayed when the pod (node) is not bootstrapped. There can be dozens of reasons for it, so just like above, I'd suggest running the log_collector.py script from this project in order to generate a diagnostic package and open a Support Ticket with Redis to upload the package so that we can understand what is causing this on your cluster and help you with it.

Cheers,

Laurent.

@soroshsabz
Copy link

ITNOA

I have the same issue

ssoroosh@master:~/ScalableProductionReadyServiceSample/Deployment/Harbor$ kubectl describe pods harbor-cluster-0
Name:         harbor-cluster-0
Namespace:    default
Priority:     0
Node:         host2/172.21.73.126
Start Time:   Thu, 03 Feb 2022 21:02:07 +0000
Labels:       app=redis-enterprise
              controller-revision-hash=harbor-cluster-7f55579578
              redis.io/cluster=harbor-cluster
              redis.io/role=node
              statefulset.kubernetes.io/pod-name=harbor-cluster-0
Annotations:  <none>
Status:       Running
IP:           10.0.2.228
IPs:
  IP:           10.0.2.228
Controlled By:  StatefulSet/harbor-cluster
Containers:
  redis-enterprise-node:
    Container ID:   docker://9e31eb53ebcebdd61123536e1c5ea6b54c73ac7ff8823bcfd7619a813ca54314
    Image:          redislabs/redis:6.2.8-64
    Image ID:       docker-pullable://redislabs/redis@sha256:9c1015546ee6b99a48d86bd8c762db457c69e3c16f2e950f468ca92629681103
    Ports:          8001/TCP, 8443/TCP, 9443/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 03 Feb 2022 21:02:20 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      1
      memory:   1Gi
    Readiness:  exec [bash -c /opt/redislabs/bin/python /opt/redislabs/mount/health_check.py] delay=0s timeout=30s period=10s #success=1 #failure=3
    Environment:
      CREDENTIAL_TYPE:  kubernetes
    Mounts:
      /opt/redislabs/credentials from credentials (rw)
      /opt/redislabs/mount from health-check-volume (ro)
      /opt/redislabs/shared from shared-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9s98m (ro)
  bootstrapper:
    Container ID:  docker://d2ddfdb3ddd54d6bea8a38698c9cfb8eabbb14455b75aebd55f537950456f1bf
    Image:         redislabs/operator:6.2.8-15
    Image ID:      docker-pullable://redislabs/operator@sha256:0f144922ea1e2d4ea72affb36238258c9f21c39d6ba9ad73da79278dde1eed37
    Port:          8787/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/bootstrapper
    State:          Running
      Started:      Thu, 03 Feb 2022 21:25:32 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 03 Feb 2022 21:19:49 +0000
      Finished:     Thu, 03 Feb 2022 21:25:23 +0000
    Ready:          True
    Restart Count:  4
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:     100m
      memory:  128Mi
    Liveness:  http-get http://:8787/livez delay=300s timeout=15s period=15s #success=1 #failure=3
    Environment:
      NAMESPACE:        default (v1:metadata.namespace)
      POD_NAME:         harbor-cluster-0 (v1:metadata.name)
      REC_NAME:         harbor-cluster
      CREDENTIAL_TYPE:  kubernetes
    Mounts:
      /etc/opt/redislabs/mount/bulletin-board from bulletin-board-volume (rw)
      /opt/redislabs/credentials from credentials (rw)
      /opt/redislabs/shared from shared-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9s98m (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  bulletin-board-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      harbor-cluster-bulletin-board
    Optional:  false
  health-check-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      harbor-cluster-health-check
    Optional:  false
  shared-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  harbor-cluster
    Optional:    false
  kube-api-access-9s98m:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  27m                    default-scheduler  Successfully assigned default/harbor-cluster-0 to host2
  Normal   Pulled     27m                    kubelet            Container image "redislabs/redis:6.2.8-64" already present on machine
  Normal   Created    27m                    kubelet            Created container redis-enterprise-node
  Normal   Started    27m                    kubelet            Started container redis-enterprise-node
  Normal   Pulled     27m                    kubelet            Container image "redislabs/operator:6.2.8-15" already present on machine
  Normal   Created    27m                    kubelet            Created container bootstrapper
  Normal   Started    27m                    kubelet            Started container bootstrapper
  Warning  Unhealthy  2m20s (x100 over 27m)  kubelet            Readiness probe failed: node id file does not exist - pod is not yet bootstrapped
pod is not yet bootstrapped

My cluster.yaml file like below

apiVersion: "app.redislabs.com/v1"
kind: "RedisEnterpriseCluster"
metadata:
  name: "harbor-cluster"
spec:
  nodes: 3
  persistentSpec:
    enabled: false
    storageClassName: "openebs-hostpath"
    # https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/quantity/
    # volumeSize: 100M
  redisEnterpriseNodeResources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 1000m
      memory: 1Gi

I run kubectl get sc and see below results

ssoroosh@master:~/ScalableProductionReadyServiceSample/Deployment/Harbor$ kubectl get sc --all-namespaces
NAME               PROVISIONER        RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
openebs-device     openebs.io/local   Delete          WaitForFirstConsumer   false                  13d
openebs-hostpath   openebs.io/local   Delete          WaitForFirstConsumer   false                  13d

What is my problem?

@soroshsabz
Copy link

@laurentdroin I think my problem from mounts but I do not know to how to resolve it

@laurentdroin
Copy link
Contributor

Hi Soorosh,

I think the problem for you is the resources. 1 GB of memory is definitely not enough and the first pod will never be able to create the cluster.
The absolute minimum amount of memory for a dev environment is 4 GB. I have been lucky with 3 GB.

Can you increase the memory to at least 3 GB and let me know if this helped?

Laurent.

@soroshsabz
Copy link

I add some memory to our cluster, and then increase memory of environment, so my problem is resolved.

@laurentdroin thanks for helping

@soroshsabz
Copy link

@laurentdroin Hi again,

After I increase my memory and resolve my previous problem, and all of things, work properly. I try to power off all nodes. after some day and turn on again my system, I see below

 kubectl get pods
NAME                                              READY   STATUS    RESTARTS        AGE
harbor-cluster-0                                  1/2     Running   26 (22s ago)    150m
harbor-cluster-services-rigger-6dcc59d7d8-p6hvn   1/1     Running   4 (137m ago)    24h
redis-enterprise-operator-7f8d8548c5-bj447        2/2     Running   26 (144m ago)   6d20h

as you can see harbor-cluster-0 does not completely ready, and after I see details of this pod, we can terrible message again

ssoroosh@master:~$ kubectl describe pod harbor-cluster-0
Name:         harbor-cluster-0
Namespace:    default
Priority:     0
Node:         host2/172.19.34.29
Start Time:   Thu, 10 Feb 2022 17:54:18 +0000
Labels:       app=redis-enterprise
              controller-revision-hash=harbor-cluster-6f5bc897db
              redis.io/cluster=harbor-cluster
              redis.io/role=node
              statefulset.kubernetes.io/pod-name=harbor-cluster-0
Annotations:  <none>
Status:       Running
IP:           10.0.2.236
IPs:
  IP:           10.0.2.236
Controlled By:  StatefulSet/harbor-cluster
Containers:
  redis-enterprise-node:
    Container ID:   docker://5cd18ba3cce456f8af2a834c348a22a5dc7cd9cb2103898963529399464fda8f
    Image:          redislabs/redis:6.2.8-64
    Image ID:       docker-pullable://redislabs/redis@sha256:9c1015546ee6b99a48d86bd8c762db457c69e3c16f2e950f468ca92629681103
    Ports:          8001/TCP, 8443/TCP, 9443/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 10 Feb 2022 18:06:46 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Thu, 10 Feb 2022 17:54:25 +0000
      Finished:     Thu, 10 Feb 2022 18:00:03 +0000
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  4Gi
    Requests:
      cpu:      1
      memory:   4Gi
    Readiness:  exec [bash -c /opt/redislabs/bin/python /opt/redislabs/mount/health_check.py] delay=0s timeout=30s period=10s #success=1 #failure=3
    Environment:
      CREDENTIAL_TYPE:  kubernetes
    Mounts:
      /opt/redislabs/credentials from credentials (rw)
      /opt/redislabs/mount from health-check-volume (ro)
      /opt/redislabs/shared from shared-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-twt5s (ro)
  bootstrapper:
    Container ID:  docker://360f7afe3f8a04c2616ae3fa976a9e5f970d98f1f77c0eed2c838ae7d95acce3
    Image:         redislabs/operator:6.2.8-15
    Image ID:      docker-pullable://redislabs/operator@sha256:0f144922ea1e2d4ea72affb36238258c9f21c39d6ba9ad73da79278dde1eed37
    Port:          8787/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/bootstrapper
    State:          Running
      Started:      Thu, 10 Feb 2022 20:30:38 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 10 Feb 2022 20:24:53 +0000
      Finished:     Thu, 10 Feb 2022 20:30:35 +0000
    Ready:          True
    Restart Count:  26
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:     100m
      memory:  128Mi
    Liveness:  http-get http://:8787/livez delay=300s timeout=15s period=15s #success=1 #failure=3
    Environment:
      NAMESPACE:        default (v1:metadata.namespace)
      POD_NAME:         harbor-cluster-0 (v1:metadata.name)
      REC_NAME:         harbor-cluster
      CREDENTIAL_TYPE:  kubernetes
    Mounts:
      /etc/opt/redislabs/mount/bulletin-board from bulletin-board-volume (rw)
      /opt/redislabs/credentials from credentials (rw)
      /opt/redislabs/shared from shared-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-twt5s (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  bulletin-board-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      harbor-cluster-bulletin-board
    Optional:  false
  health-check-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      harbor-cluster-health-check
    Optional:  false
  shared-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  harbor-cluster
    Optional:    false
  kube-api-access-twt5s:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  46s (x1047 over 145m)  kubelet  Readiness probe failed: node id file does not exist - pod is not yet bootstrapped
pod is not yet bootstrapped

and bootstrapper log like below

ssoroosh@master:~$ kubectl logs harbor-cluster-0 -c bootstrapper
time="2022-02-10T20:30:40Z" level=info msg="REC name: harbor-cluster"
time="2022-02-10T20:30:40Z" level=info msg="Cluster Name: harbor-cluster.default.svc.cluster.local"
time="2022-02-10T20:30:40Z" level=info msg="No rack ID specified"
time="2022-02-10T20:30:45Z" level=info msg="getting bootstrap information from Redis Enterprise API"
time="2022-02-10T20:30:45Z" level=info msg="Redis Enterprise API is accessible, and ready for bootstrap"
time="2022-02-10T20:30:45Z" level=info msg="All pods perform join_cluster"

As you can see my pods have sufficient memory, How to resolve my problem?

@soroshsabz
Copy link

related to #214

@laurentdroin
Copy link
Contributor

Hi @soroshsabz,

Yes, as I explained in the other issue you opened (I didn't see you opened this new one), this is expected.
After a Redis Enterprise cluster is created, quorum must be maintained at all times. We define quorum as the majority of nodes.
In a 3 nodes cluster, you must always have 2 nodes up and ready at any given time. Your cluster will not survive having 2 or all the nodes down at the same time.

If your cluster has lost quorum and is therefore no longer working, you would need to recover it using this procedure: https://docs.redis.com/latest/kubernetes/re-clusters/cluster-recovery/

I hope this helps.

Laurent.

@soroshsabz
Copy link

After remove cluster, I can create new and healthy cluster

Thanks to @laurentdroin

@cschockaert
Copy link

cschockaert commented Jul 4, 2022

Hello, got the same problem, @laurentdroin i understand that we need to maintain the cluster at all times to get that working fine. But since we are using an Enterprise Operator (and we pay for it) can you make the magic happens and automatically resolve this problem in case of emergency reboot of our K8S nodes.
For example, i got in that state this morning for a 5 nodes redis cluster just after a GKE 1.20 to 1.21 migration.
this is not acceptable in production. A cluster should not broke that ways so easily...

If we are moving from standalone redis or redis/sentinel or even google memorystore to redis entreprise it's for have more:
high availability (but here... we got lesser)
better scalability (it's OK for this point)

Actually we cannot move forward using that product in this state, we do not ask the moon... just something that's works

@yuvallevy2
Copy link
Contributor

@cschockaert thank you for the feedback!
Question - have you opened a support case for the issue? Engineers from Redis will be happy to get you unblocked.
Thx

@cschockaert
Copy link

cschockaert commented Jul 6, 2022

Hello, i'm in touch with redis team (@fcerbelle), (not directly the support). Actually, we are using the preemptible (https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms) feature of GKE (node can be killed anytime after 24H of life ; but cost us 10x less)
I don't know if the quorum was lost because of GKE upgrade or preemptible.
For now, the only solution to mitigate this cluster failure would be to not use preemptible and / or to add more node to the cluster so the quorum would be harder to get in a state to be lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants