-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: "pod not available", "Cannot extract Pod status", "WAL file not found in the recovery object store" #4551
Comments
I fixed the unhealthy state of the cluster by destroying the instance for now. The operator created a new pod However that should not be the solution to this kind of error. The operator should be able to heal the cluster if one worker node goes down for whatever reason. I still hope it is just a misconfiguration but I am quite sure I did nothing wrong here. |
Before I had destroyed
I found it kind of weird that the file |
Is there an existing issue already for this bug?
I have read the troubleshooting guide
I am running a supported version of CloudNativePG
Contact Details
No response
Version
older in 1.21.x
What version of Kubernetes are you using?
1.29
What is your Kubernetes environment?
Self-managed: RKE
How did you install the operator?
Helm
What happened?
We have a kubernetes cluster with 6 worker nodes, distributed into 3 zones with each having 2 nodes.
The CNPG cluster consists of 3 instances and a pooler, created with the CNPG Operator.
I then shut down the worker node containing the primary instance of the postgres cluster forcefully to see if it heals itself properly.
Within seconds another instance was promoted as the primary and it seems to work fine.
But I assumed that after a while the operator will spawn a new standby instance on the other worker in the same zone that is still runnung. That never happened, even after waiting for 20 minutes. So the cluster never became healthy again on its own.
I then started the node again which I shut down before. After a while the missing postgres instance pod got created again but now it is stuck and
kubectl cnpg status
showspod not available
while the pod logs show that it has restored itself from the archived WAL files in the S3 bucket except that it tries to restore from a file that is newer than the backup status shows. But see for yourself:Cluster status:
As you can see it still thinks it is in the failing over state although the primary is ready again since 20 hours.
You can also see that the latest WAL archive is
0000000D0000000000000039
.Now let's have a look into the pod
keycloak-db-pgcluster-5
(I removed a bit of clutter from the logs so they are more easy to read):Why does it try to read
0000000D000000000000003A
? And why isendOfWALStream
set tofalse
on the last WAL file when it should betrue
? Or am I misinterpreting something here?And here is the log of the postgres-operator which is repeating every 2 seconds:
We experience that problem quite often when we do maintenance on the kubernetes nodes and need to restart them. The already destroyed multiple clusters because of that. The only way to fix this is to destroy the instance with
kubectl cnpg destroy keycloak-db-pgcluster 5
and wait for a fresh instance to come up. Afterwards we can delete the PVC and PV of the old instance since it makes no sense to keep it.We want to upgrade soon to the latest version but after finding this recent bug I think this still can happen in newer versions: #4412
Cluster resource
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: