Recover after user deletes PVCs of an existing cluster #4403

ma-ts · 2024-04-29T15:44:37Z

ma-ts
Apr 29, 2024

Hi all!

We're looking at some of the use cases in which a Cloud Native PG cluster does not recover by itself from issues. We saw that the only way that we can get into such a position is if we delete the PVCs associated with the cluster directly. Currently, we then get in either of two situations:

Deleting the primary PVC and force-killing the pod (no clean shutdown)

This one gets into a reconciliation loop, until I restart the other pods

One of the replicas

postgres {"level":"info","ts":"2024-04-29T15:35:27Z","msg":"reloading the instance","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pomerium-sessions","namespace":"pomerium"},"namespace":"pomerium","name":"pomerium-sessions","reconcileID":"67a5e9f1-7e81-4214-ba4d-a3307824c4bb","logging_pod":"pomerium-sessions-4"}
postgres {"level":"info","ts":"2024-04-29T15:35:27Z","msg":"Requesting configuration reload","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pomerium-sessions","namespace":"pomerium"},"namespace":"pomerium","name":"pomerium-sessions","reconcileID":"67a5e9f1-7e81-4214-ba4d-a3307824c4bb","logging_pod":"pomerium-sessions-4","pgdata":"/var/lib/postgresql/data/pgdata"}
postgres {"level":"info","ts":"2024-04-29T15:35:27Z","logger":"pg_ctl","msg":"server signaled","pipe":"stdout","logging_pod":"pomerium-sessions-4"}
postgres {"level":"info","ts":"2024-04-29T15:35:27Z","logger":"postgres","msg":"record","logging_pod":"pomerium-sessions-4","record":{"log_time":"2024-04-29 15:35:27.612 UTC","process_id":"21","session_id":"662fbcdf.15","session_line_num":"151","session_start_time":"2024-04-29 15:29:35 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"received SIGHUP, reloading configuration files","backend_type":"postmaster","query_id":"0"}}

Cloud Native PG Controller

{"level":"info","ts":"2024-04-29T15:37:16Z","msg":"Current primary isn't healthy, initiating a failover","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pomerium-sessions","namespace":"pomerium"},"namespace":"pomerium","name":"pomerium-sessions","reconcileID":"6a2a198b-1bfa-43e9-bfde-57dc1baeb83e"}
{"level":"info","ts":"2024-04-29T15:37:16Z","msg":"pod status (1 of 2)","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pomerium-sessions","namespace":"pomerium"},"namespace":"pomerium","name":"pomerium-sessions","reconcileID":"6a2a198b-1bfa-43e9-bfde-57dc1baeb83e","name":"pomerium-sessions-4","currentLsn":"","receivedLsn":"0/B000060","replayLsn":"0/B000060","isPrimary":false,"isPodReady":true,"pendingRestart":false,"pendingRestartForDecrease":false,"statusCollectionError":null}
{"level":"info","ts":"2024-04-29T15:37:16Z","msg":"pod status (2 of 2)","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pomerium-sessions","namespace":"pomerium"},"namespace":"pomerium","name":"pomerium-sessions","reconcileID":"6a2a198b-1bfa-43e9-bfde-57dc1baeb83e","name":"pomerium-sessions-5","currentLsn":"","receivedLsn":"0/B000060","replayLsn":"0/B000060","isPrimary":false,"isPodReady":true,"pendingRestart":false,"pendingRestartForDecrease":false,"statusCollectionError":null}
{"level":"info","ts":"2024-04-29T15:37:16Z","msg":"Waiting for all WAL receivers to be down to elect a new primary","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pomerium-sessions","namespace":"pomerium"},"namespace":"pomerium","name":"pomerium-sessions","reconcileID":"6a2a198b-1bfa-43e9-bfde-57dc1baeb83e"}

Deleting all PVCs and force killing all pods

This actually gets the cluster in a weird state. It is trying to create new replicas (in this case pomerium-sessions-6) referring to a PVC that the Operator has not yet created. It stays in this loop until we recreate the Cluster resource (so basically starting from scratch). How can we get out of this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover after user deletes PVCs of an existing cluster #4403

{{title}}

Replies: 0 comments

Select a reply

Recover after user deletes PVCs of an existing cluster #4403

ma-ts Apr 29, 2024

Replies: 0 comments

ma-ts
Apr 29, 2024