Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: "pod not available", "Cannot extract Pod status", "WAL file not found in the recovery object store" #4551

Open
4 tasks done
ngoeddel-openi opened this issue May 16, 2024 · 2 comments
Assignees
Labels
triage Pending triage

Comments

@ngoeddel-openi
Copy link

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

No response

Version

older in 1.21.x

What version of Kubernetes are you using?

1.29

What is your Kubernetes environment?

Self-managed: RKE

How did you install the operator?

Helm

What happened?

We have a kubernetes cluster with 6 worker nodes, distributed into 3 zones with each having 2 nodes.
The CNPG cluster consists of 3 instances and a pooler, created with the CNPG Operator.
I then shut down the worker node containing the primary instance of the postgres cluster forcefully to see if it heals itself properly.
Within seconds another instance was promoted as the primary and it seems to work fine.
But I assumed that after a while the operator will spawn a new standby instance on the other worker in the same zone that is still runnung. That never happened, even after waiting for 20 minutes. So the cluster never became healthy again on its own.

I then started the node again which I shut down before. After a while the missing postgres instance pod got created again but now it is stuck and kubectl cnpg status shows pod not available while the pod logs show that it has restored itself from the archived WAL files in the S3 bucket except that it tries to restore from a file that is newer than the backup status shows. But see for yourself:

Cluster status:

$[1] kubectl cnpg -n middleware-svc-keycloak status keycloak-db-pgcluster
Cluster Summary
Name:                keycloak-db-pgcluster
Namespace:           middleware-svc-keycloak
System ID:           7343941203355570200
PostgreSQL Image:    ghcr.io/cloudnative-pg/postgresql:15.3
Primary instance:    keycloak-db-pgcluster-2
Primary start time:  2024-05-15 15:35:18 +0000 UTC (uptime 19h26m11s)
Status:              Failing over Failing over from keycloak-db-pgcluster-5 to keycloak-db-pgcluster-2
Instances:           3
Ready instances:     2
Current Write LSN:   0/3A000000 (Timeline: 13 - WAL File: 0000000D0000000000000039)

Certificates Status
Certificate Name                   Expiration Date                Days Left Until Expiration
----------------                   ---------------                --------------------------
keycloak-db-pgcluster-ca           2024-06-06 10:37:14 +0000 UTC  20.98
keycloak-db-pgcluster-replication  2024-06-06 10:37:14 +0000 UTC  20.98
keycloak-db-pgcluster-server       2024-06-06 10:37:14 +0000 UTC  20.98

Continuous Backup status
First Point of Recoverability:  2024-05-15T12:55:22Z
Working WAL archiving:          OK
WALs waiting to be archived:    0
Last Archived WAL:              0000000D0000000000000039   @   2024-05-15T15:45:20.25018Z
Last Failed WAL:                00000008.history           @   2024-05-15T12:22:29.298705Z

Physical backups
No running physical backups found

Streaming Replication status
Name                     Sent LSN    Write LSN   Flush LSN   Replay LSN  Write Lag  Flush Lag  Replay Lag  State      Sync State  Sync Priority
----                     --------    ---------   ---------   ----------  ---------  ---------  ----------  -----      ----------  -------------
keycloak-db-pgcluster-5  0/3A000000  0/3A000000  0/3A000000  0/3A000000  00:00:00   00:00:00   00:00:00    streaming  async       0
keycloak-db-pgcluster-6  0/3A000000  0/3A000000  0/3A000000  0/3A000000  00:00:00   00:00:00   00:00:00    streaming  quorum      1

Unmanaged Replication Slot Status
No unmanaged replication slots found

Managed roles status
No roles managed

Tablespaces status
No managed tablespaces

Pod Disruption Budgets status
Name                           Role     Expected Pods  Current Healthy  Minimum Desired Healthy  Disruptions Allowed
----                           ----     -------------  ---------------  -----------------------  -------------------
keycloak-db-pgcluster          replica  2              1                1                        0
keycloak-db-pgcluster-primary  primary  1              1                1                        0

Instances status
Name                     Database Size  Current LSN  Replication role  Status             QoS        Manager Version  Node
----                     -------------  -----------  ----------------  ------             ---        ---------------  ----
keycloak-db-pgcluster-2  34 MB          0/3A000000   Primary           OK                 Burstable  1.20.2           project-d-kubw13p
keycloak-db-pgcluster-6  34 MB          0/3A000000   Standby (sync)    OK                 Burstable  1.20.2           project-d-kubw14p
keycloak-db-pgcluster-5  -              -            -                 pod not available  Burstable  -                project-d-kubw12p

As you can see it still thinks it is in the failing over state although the primary is ready again since 20 hours.
You can also see that the latest WAL archive is 0000000D0000000000000039.

Now let's have a look into the pod keycloak-db-pgcluster-5 (I removed a bit of clutter from the logs so they are more easy to read):

[
  {
    "ts": "2024-05-16T10:41:33Z",
    "msg": "WAL file not found in the recovery object store",
    "walName": "0000000E.history",
    "options": [
      "--endpoint-url", "https://s3.example.com", "--cloud-provider", "aws-s3", "s3://project-dev/postgres-keycloak-backup", "keycloak-db-pgcluster"
    ]
  },
  {
    "ts": "2024-05-16T10:41:34Z",
    "msg": "Restored WAL file",
    "walName": "0000000D.history"
  },
  {
    "ts": "2024-05-16T10:41:34Z",
    "msg": "WAL restore command completed (parallel)",
    "walName": "0000000D.history",
    "maxParallel": 1,
    "successfulWalRestore": 1,
    "failedWalRestore": 0,
    "endOfWALStream": false
  },
  {
    "ts": "2024-05-16T10:41:36Z",
    "msg": "Restored WAL file",
    "walName": "0000000D0000000000000039"
  },
  {
    "ts": "2024-05-16T10:41:36Z",
    "msg": "WAL restore command completed (parallel)",
    "walName": "0000000D0000000000000039",
    "maxParallel": 1,
    "successfulWalRestore": 1,
    "failedWalRestore": 0,
    "endOfWALStream": false
  },
  {
    "ts": "2024-05-16T10:41:38Z",
    "msg": "Restored WAL file",
    "walName": "0000000D0000000000000038"
  },
  {
    "ts": "2024-05-16T10:41:38Z",
    "msg": "WAL restore command completed (parallel)",
    "walName": "0000000D0000000000000038",
    "maxParallel": 1,
    "successfulWalRestore": 1,
    "failedWalRestore": 0,
    "endOfWALStream": false
  },
  {
    "ts": "2024-05-16T10:41:40Z",
    "msg": "Restored WAL file",
    "walName": "0000000D0000000000000039"
  },
  {
    "ts": "2024-05-16T10:41:40Z",
    "msg": "WAL restore command completed (parallel)",
    "walName": "0000000D0000000000000039",
    "maxParallel": 1,
    "successfulWalRestore": 1,
    "failedWalRestore": 0,
    "endOfWALStream": false
  },
  {
    "ts": "2024-05-16T10:41:42Z",
    "msg": "WAL file not found in the recovery object store",
    "walName": "0000000D000000000000003A",
    "options": [
      "--endpoint-url", "https://s3.example.com", "--cloud-provider", "aws-s3", "s3://project-dev/postgres-keycloak-backup", "keycloak-db-pgcluster"
    ]
  }
]

Why does it try to read 0000000D000000000000003A? And why is endOfWALStream set to false on the last WAL file when it should be true? Or am I misinterpreting something here?

And here is the log of the postgres-operator which is repeating every 2 seconds:

{
  "ts": "2024-05-16T12:03:38Z",
  "msg": "Cannot extract Pod status",
  "controller": "cluster",
  "controllerGroup": "postgresql.cnpg.io",
  "controllerKind": "Cluster",
  "Cluster": {
    "name": "keycloak-db-pgcluster",
    "namespace": "middleware-svc-keycloak"
  },
  "namespace": "middleware-svc-keycloak",
  "name": "keycloak-db-pgcluster",
  "name": "keycloak-db-pgcluster-5",
  "error": "error status code: 500, body: failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is not yet accepting connections (SQLSTATE 57P03))\n"
}
{
  "level": "info",
  "ts": "2024-05-16T12:03:39Z",
  "msg": "pooler not automatically configured, manual configuration required",
  "cluster": "keycloak-db-pgcluster",
  "pooler": "keycloak-db-pgcluster-pooler-rw"
}

We experience that problem quite often when we do maintenance on the kubernetes nodes and need to restart them. The already destroyed multiple clusters because of that. The only way to fix this is to destroy the instance with kubectl cnpg destroy keycloak-db-pgcluster 5 and wait for a fresh instance to come up. Afterwards we can delete the PVC and PV of the old instance since it makes no sense to keep it.

We want to upgrade soon to the latest version but after finding this recent bug I think this still can happen in newer versions: #4412

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  annotations:
    cnpg.io/reconciliationLoop: enabled
    meta.helm.sh/release-name: keycloak-db
    meta.helm.sh/release-namespace: middleware-svc-keycloak
  generation: 3
  labels:
    app.kubernetes.io/instance: keycloak-db
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: pgcluster
    app.kubernetes.io/version: '15.1'
    helm.sh/chart: pgcluster-0.1.14
    helm.toolkit.fluxcd.io/name: keycloak-db
    helm.toolkit.fluxcd.io/namespace: middleware-svc-keycloak
    k8slens-edit-resource-version: v1
  name: keycloak-db-pgcluster
  namespace: middleware-svc-keycloak
spec:
  affinity:
    podAntiAffinityType: required
    topologyKey: topology.kubernetes.io/zone
  backup:
    barmanObjectStore:
      destinationPath: s3: //project-dev/postgres-keycloak-backup
      endpointURL: https: //s3.example.com
      s3Credentials:
        accessKeyId:
          key: ACCESS_KEY_ID
          name: keycloak-backup-s3
        inheritFromIAMRole: false
        secretAccessKey:
          key: ACCESS_SECRET_KEY
          name: keycloak-backup-s3
      wal:
        compression: gzip
    retentionPolicy: 30d
    target: prefer-standby
  bootstrap:
    initdb:
      database: keycloak
      encoding: UTF8
      localeCType: C
      localeCollate: C
      owner: keycloak
      secret:
        name: postgresql-keycloak-password
  description: Postgres general configuration
  enableSuperuserAccess: true
  failoverDelay: 0
  imageName: ghcr.io/cloudnative-pg/postgresql:15.3
  imagePullPolicy: IfNotPresent
  inheritedMetadata:
    labels:
      app.kubernetes.io/instance: keycloak-db
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: pgcluster
      app.kubernetes.io/version: '15.1'
      helm.sh/chart: pgcluster-0.1.14
  instances: 3
  logLevel: warning
  maxSyncReplicas: 2
  minSyncReplicas: 1
  monitoring:
    customQueriesConfigMap:
      - key: queries
        name: cnpg-default-monitoring
    disableDefaultQueries: false
    enablePodMonitor: true
  nodeMaintenanceWindow:
    inProgress: false
    reusePVC: false
  postgresGID: 26
  postgresUID: 26
  postgresql:
    parameters:
      archive_mode: 'on'
      archive_timeout: 5min
      auto_explain.log_min_duration: 10s
      dynamic_shared_memory_type: posix
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_rotation_age: '0'
      log_rotation_size: '0'
      log_truncate_on_rotation: 'false'
      logging_collector: 'on'
      max_parallel_workers: '32'
      max_replication_slots: '32'
      max_worker_processes: '32'
      pg_stat_statements.max: '10000'
      pg_stat_statements.track: all
      shared_buffers: 256MB
      shared_memory_type: mmap
      shared_preload_libraries: ''
      wal_keep_size: 512MB
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
    syncReplicaElectionConstraint:
      enabled: false
  primaryUpdateMethod: restart
  primaryUpdateStrategy: unsupervised
  resources:
    requests:
      cpu: 250m
      memory: 512Mi
  startDelay: 300
  stopDelay: 300
  storage:
    resizeInUseVolumes: true
    size: 1Gi
    storageClass: longhorn-2-repl-retain
  switchoverDelay: 2

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ngoeddel-openi
Copy link
Author

I fixed the unhealthy state of the cluster by destroying the instance for now. The operator created a new pod keycloak-db-pgcluster-7 and a temporary join job and after a minute the cluster got healthy again.

However that should not be the solution to this kind of error. The operator should be able to heal the cluster if one worker node goes down for whatever reason. I still hope it is just a misconfiguration but I am quite sure I did nothing wrong here.

@ngoeddel-openi
Copy link
Author

Before I had destroyed keycloak-db-pgcluster-5 I had a look into its filesystem, especially into the folder /var/lib/postgresql/data/pgdata/pg_wal:

postgres@keycloak-db-pgcluster-5:~/data/pgdata/pg_wal$ ls -la
total 606256
drwxrws---  3 postgres tape     4096 May 16 10:41 .
drwx------ 19 postgres tape     4096 May 16 10:41 ..
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000005000000000000001A
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000005000000000000001B
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000005000000000000001C
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000005000000000000001D
-rw-rw----  1 postgres tape      169 May 15 16:00 00000005.history
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000006000000000000001E
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000006000000000000001F
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000060000000000000020
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000060000000000000021
-rw-rw----  1 postgres tape      212 May 15 16:00 00000006.history
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000022
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000023
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000024
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000025
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000026
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000027
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000028
-rw-rw----  1 postgres tape 16777216 May 15 16:00 000000070000000000000029
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000007000000000000002A
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000007000000000000002B.partial
-rw-rw----  1 postgres tape      255 May 15 16:00 00000007.history
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000008000000000000002B
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000008000000000000002C
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000008000000000000002D
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000008000000000000002E
-rw-rw----  1 postgres tape      298 May 15 16:00 00000008.history
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000009000000000000002E
-rw-rw----  1 postgres tape 16777216 May 15 16:00 00000009000000000000002F
-rw-rw----  1 postgres tape      341 May 15 16:00 00000009.history
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000A0000000000000030.partial
-rw-rw----  1 postgres tape      384 May 15 16:00 0000000A.history
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000B0000000000000030
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000B0000000000000031
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000B0000000000000032
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000B0000000000000033
-rw-rw----  1 postgres tape      428 May 15 16:00 0000000B.history
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000C0000000000000033
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000C0000000000000034
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000C0000000000000035
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000C0000000000000036
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000C0000000000000037
-rw-rw----  1 postgres tape      472 May 15 16:00 0000000C.history
-rw-------  1 postgres tape 16777216 May 16 10:41 0000000D0000000000000038
-rw-------  1 postgres tape 16777216 May 16 10:41 0000000D0000000000000039
-rw-rw----  1 postgres tape 16777216 May 15 16:00 0000000D000000000000003A
-rw-------  1 postgres tape      516 May 16 10:41 0000000D.history
drwxrws---  2 postgres tape     4096 May 15 16:00 archive_status

I found it kind of weird that the file 0000000D000000000000003A exists here but not in the other instances. Can this be a remnant from being the primary instance before the sudden shutdown of the worker node?
I deleted that file manually and restarted the pod but it did not solve the issue. It still was a bit weird to see that file there which does not exist in the S3 backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Pending triage
Projects
None yet
Development

No branches or pull requests

2 participants