Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Volume snapshots WAL file Issues when spinning up replicas #4488

Open
4 tasks done
ajrpayne opened this issue May 8, 2024 · 2 comments
Open
4 tasks done

[Bug]: Volume snapshots WAL file Issues when spinning up replicas #4488

ajrpayne opened this issue May 8, 2024 · 2 comments
Assignees
Labels
triage Pending triage

Comments

@ajrpayne
Copy link

ajrpayne commented May 8, 2024

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

No response

Version

1.23.0

What version of Kubernetes are you using?

1.27

What is your Kubernetes environment?

Cloud: Azure AKS

How did you install the operator?

YAML manifest

What happened?

Using (1.23.1 cnpg, postgres image 13.14-18). When creating a relica, increasing the instance count by 1, I first create an online volume snapshot. Taking an online volume snapshot appears to create an issue with the WAL files. The replica spins up using the volume snapshot, but then gets stuck processing the WAL file created after the backup finished.

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cnpg-test2
spec:
  imageName: cnpg:13.14-18
  imagePullPolicy: IfNotPresent
  instances: 1 # <- increased this to 2.

  postgresql:
    parameters:
      max_connections: "1000"
      shared_buffers: "512MB"
      effective_cache_size: "1536MB"
      maintenance_work_mem: "128MB"
      checkpoint_completion_target: "0.9"
      wal_buffers: "16MB"
      default_statistics_target: "100"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      work_mem: "262kB"
      huge_pages: "off"
      min_wal_size: "2GB"
      max_wal_size: "8GB"
      max_slot_wal_keep_size: "40GB"

  storage:
    storageClass: managed-csi
    size: 64Gi
  walStorage:
    storageClass: managed-csi
    size: 64Gi

  startDelay: 3600
  smartShutdownTimeout: 1800
  stopDelay: 3600
  switchoverDelay: 7200

  resources:
    requests:
      memory: 2Gi
    limits:
      memory: 2Gi
      hugepages-2Mi: 768Mi

  primaryUpdateStrategy: unsupervised
  primaryUpdateMethod: restart

  backup:
    volumeSnapshot:
      className: csi-azuredisk-vsc
      walClassName: csi-azuredisk-vsc
    barmanObjectStore:
      s3Credentials:
        accessKeyId:
          name: cnpg-test-s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: cnpg-test-s3-creds
          key: SECRET_ACCESS_KEY
      endpointURL: "https://s3.bucket.com"
      destinationPath: "s3://cnpg-test/"
      serverName: cnpg-test2
      wal:
        compression: snappy
        encryption: AES256
        maxParallel: 4
      data:
        compression: snappy
        encryption: AES256
        jobs: 4
    retentionPolicy: "30d"
    target: "prefer-standby"

  logLevel: info

  managed:
    roles:
      - name: app
        ensure: present
        passwordSecret:
          name: cnpg-test-app
        connectionLimit: -1
        inherit: true
        createdb: true
        login: true
      - name: user2
        ensure: present
        passwordSecret:
          name: cnpg-test-user2
        connectionLimit: -1
        inherit: true
        superuser: true
        createdb: true
        createrole: true
        login: true
      - name: user3
        ensure: present
        passwordSecret:
          name: cnpg-test-user3
        connectionLimit: -1
        inRoles:
          - pg_monitor
          - pg_read_all_stats
        inherit: true
        login: true
      - name: user4
        ensure: present
        passwordSecret:
          name: cnpg-test-user4
        connectionLimit: -1
        inherit: true
        login: true

  bootstrap:
    recovery:
      database: app
      owner: app
      secret:
        name: cnpg-test-app
      source: cnpg-test

  externalClusters:
    - name: cnpg-test
      barmanObjectStore:
        s3Credentials:
          accessKeyId:
            name: cnpg-test-s3-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: cnpg-test-s3-creds
            key: SECRET_ACCESS_KEY
        endpointURL: "https://s3.bucket.com"
        destinationPath: "s3://cnpg-test/"
        serverName: cnpg-test
        wal:
          maxParallel: 4
---
apiVersion: postgresql.cnpg.io/v1
kind: Backup
metadata:
  name: b-test2-99
spec:
  cluster:
    name: cnpg-test2
  target: primary
  online: true
  onlineConfiguration:
    immediateCheckpoint: false
    waitForArchive: true
  method: volumeSnapshot

Relevant log output

Restored WAL file","logging_pod":"cnpg-test2-2","walName":"000000060000027D00000033"
Set end-of-wal-stream flag as one of the WAL files to be prefetched was not found
"WAL restore command completed (parallel)","logging_pod":"cnpg-test2-2","walName":"000000060000027D00000033"
restored log file \"000000060000027D00000033\
invalid resource manager ID 88 at 27D/CC0000A0
invalid resource manager ID 88 at 27D/CC0000A0
terminating walreceiver process due to administrator command
end-of-wal-stream flag found. Exiting with error once to let Postgres try switching to streaming replication
*repeats forever*

Code of Conduct

  • I agree to follow this project's Code of Conduct
@ajrpayne ajrpayne added the triage Pending triage label May 8, 2024
@ajrpayne
Copy link
Author

ajrpayne commented May 8, 2024

If I use barman for backups, the replica spins up using pg_basebackup. and has no issues.

@gbartolini
Copy link
Contributor

Remember that this is a standby. So the same limitations of PostgreSQL apply here (see https://cloudnative-pg.io/documentation/current/backup/#backup-from-a-standby - the blue box).

Is this server under workload? Have you tried running the backup a few minutes after the server is up (my fear is that you don't have yet the WAL file containing the checkpoint coming from the primary when you run that backup and when you create the replica).

Can you please share the backup and the volume snapshot resources too? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Pending triage
Projects
Development

No branches or pull requests

2 participants