[Bug]: Volume snapshots WAL file Issues when spinning up replicas #4488

ajrpayne · 2024-05-08T18:19:20Z

Is there an existing issue already for this bug?

I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

I have read the troubleshooting guide and I think this is a new bug.

Contact Details

No response

Version

1.23.0

What version of Kubernetes are you using?

1.27

What is your Kubernetes environment?

Cloud: Azure AKS

How did you install the operator?

YAML manifest

What happened?

Using (1.23.1 cnpg, postgres image 13.14-18). When creating a relica, increasing the instance count by 1, I first create an online volume snapshot. Taking an online volume snapshot appears to create an issue with the WAL files. The replica spins up using the volume snapshot, but then gets stuck processing the WAL file created after the backup finished.

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cnpg-test2
spec:
  imageName: cnpg:13.14-18
  imagePullPolicy: IfNotPresent
  instances: 1 # <- increased this to 2.

  postgresql:
    parameters:
      max_connections: "1000"
      shared_buffers: "512MB"
      effective_cache_size: "1536MB"
      maintenance_work_mem: "128MB"
      checkpoint_completion_target: "0.9"
      wal_buffers: "16MB"
      default_statistics_target: "100"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      work_mem: "262kB"
      huge_pages: "off"
      min_wal_size: "2GB"
      max_wal_size: "8GB"
      max_slot_wal_keep_size: "40GB"

  storage:
    storageClass: managed-csi
    size: 64Gi
  walStorage:
    storageClass: managed-csi
    size: 64Gi

  startDelay: 3600
  smartShutdownTimeout: 1800
  stopDelay: 3600
  switchoverDelay: 7200

  resources:
    requests:
      memory: 2Gi
    limits:
      memory: 2Gi
      hugepages-2Mi: 768Mi

  primaryUpdateStrategy: unsupervised
  primaryUpdateMethod: restart

  backup:
    volumeSnapshot:
      className: csi-azuredisk-vsc
      walClassName: csi-azuredisk-vsc
    barmanObjectStore:
      s3Credentials:
        accessKeyId:
          name: cnpg-test-s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-test-s3-creds
          key: SECRET_ACCESS_KEY
      endpointURL: "https://s3.bucket.com"
      destinationPath: "s3://cnpg-test/"
      serverName: cnpg-test2
      wal:
        compression: snappy
        encryption: AES256
        maxParallel: 4
      data:
        compression: snappy
        encryption: AES256
        jobs: 4
    retentionPolicy: "30d"
    target: "prefer-standby"

  logLevel: info

  managed:
    roles:
      - name: app
        ensure: present
        passwordSecret:
          name: cnpg-test-app
        connectionLimit: -1
        inherit: true
        createdb: true
        login: true
      - name: user2
        ensure: present
        passwordSecret:
          name: cnpg-test-user2
        connectionLimit: -1
        inherit: true
        superuser: true
        createdb: true
        createrole: true
        login: true
      - name: user3
        ensure: present
        passwordSecret:
          name: cnpg-test-user3
        connectionLimit: -1
        inRoles:
          - pg_monitor
          - pg_read_all_stats
        inherit: true
        login: true
      - name: user4
        ensure: present
        passwordSecret:
          name: cnpg-test-user4
        connectionLimit: -1
        inherit: true
        login: true

  bootstrap:
    recovery:
      database: app
      owner: app
      secret:
        name: cnpg-test-app
      source: cnpg-test

  externalClusters:
    - name: cnpg-test
      barmanObjectStore:
        s3Credentials:
          accessKeyId:
            name: cnpg-test-s3-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-test-s3-creds
            key: SECRET_ACCESS_KEY
        endpointURL: "https://s3.bucket.com"
        destinationPath: "s3://cnpg-test/"
        serverName: cnpg-test
        wal:
          maxParallel: 4
---
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: b-test2-99
spec:
  cluster:
    name: cnpg-test2
  target: primary
  online: true
  onlineConfiguration:
    immediateCheckpoint: false
    waitForArchive: true
  method: volumeSnapshot

Relevant log output

Restored WAL file","logging_pod":"cnpg-test2-2","walName":"000000060000027D00000033"
Set end-of-wal-stream flag as one of the WAL files to be prefetched was not found
"WAL restore command completed (parallel)","logging_pod":"cnpg-test2-2","walName":"000000060000027D00000033"
restored log file \"000000060000027D00000033\
invalid resource manager ID 88 at 27D/CC0000A0
invalid resource manager ID 88 at 27D/CC0000A0
terminating walreceiver process due to administrator command
end-of-wal-stream flag found. Exiting with error once to let Postgres try switching to streaming replication
*repeats forever*

Code of Conduct

I agree to follow this project's Code of Conduct

ajrpayne · 2024-05-08T18:22:05Z

If I use barman for backups, the replica spins up using pg_basebackup. and has no issues.

gbartolini · 2024-05-31T10:46:30Z

Remember that this is a standby. So the same limitations of PostgreSQL apply here (see https://cloudnative-pg.io/documentation/current/backup/#backup-from-a-standby - the blue box).

Is this server under workload? Have you tried running the backup a few minutes after the server is up (my fear is that you don't have yet the WAL file containing the checkpoint coming from the primary when you run that backup and when you create the replica).

Can you please share the backup and the volume snapshot resources too? Thanks.

ajrpayne added the triage Pending triage label May 8, 2024

ajrpayne assigned gbartolini May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Volume snapshots WAL file Issues when spinning up replicas #4488

[Bug]: Volume snapshots WAL file Issues when spinning up replicas #4488

ajrpayne commented May 8, 2024

ajrpayne commented May 8, 2024

gbartolini commented May 31, 2024

[Bug]: Volume snapshots WAL file Issues when spinning up replicas #4488

[Bug]: Volume snapshots WAL file Issues when spinning up replicas #4488

Comments

ajrpayne commented May 8, 2024

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

Version

What version of Kubernetes are you using?

What is your Kubernetes environment?

How did you install the operator?

What happened?

Cluster resource

Relevant log output

Code of Conduct

ajrpayne commented May 8, 2024

gbartolini commented May 31, 2024