Vault with raft storage failed to join after applying changes in manifest #1627

revathyr13 · 2022-05-23T06:27:21Z

Hello Team,

We are facing auto join issues with vault HA raft cluster in kubernetes.
vault.txt

Issue

We are deploying the vault cluster in kubernetes as raft storage. The unseal keys are stored in our master vault.
The manifest we are using to deploy the vault cluster is attached . For the first time the deployment looks good. All vault pods will come up without any issues.

NAME READY STATUS RESTARTS AGE
vault-0 3/3 Running 0 9m44s
vault-1 3/3 Running 0 5m32s
vault-2 3/3 Running 0 49s

But once if make any change in manifest [say change settings veleroEnabled: true to veleroEnabled: false] and re-apply the changes using kubectl apply -f vault.yaml , the pod vault-2 will come up with applied change and it will show in ready state. But vault-1 wont comeup.

vault-0 3/3 Running 0 44m
vault-1 1/2 CrashLoopBackOff 12 40m
vault-2 2/2 Running 0 40m

While digging we noticed vault-2 is just showing in ready status whereas it's not joined to any of the cluster

$ kubectl exec -ti vault-2 sh
/ # export VAULT_ADDR='https://vault-2:8200'
/ # vault operator raft list-peers
No raft cluster configuration found

As it cameup with ready status, the operator starts to apply changes in vault-1 which breaks the whole raft cluster.
It shouldn't happen. The vault-2 should come up in ready state only once it joined to existing cluster which is not happening here.
Logs from vault-1

2022-05-23T06:28:40.757Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault:8200
2022-05-23T06:28:40.780Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault:8200 error="could not retrieve raft bootstrap package"
2022-05-23T06:28:40.780Z [ERROR] core: failed to join raft cluster: error="timed out on raft join: %!w()"

Logs from vault-2

2022-05-23T06:31:20.243Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
2022-05-23T06:31:25.243Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error"

We believe increasing the readiness limits or adding initialDelaySeconds in the operator will help. As its a operator side issue, can someone please have a look and assist.

Thank you

pzim · 2023-02-10T21:21:05Z

@revathyr13 - did you ever figure out how to resolve vault-1 from crashlooping? Seeing something similar in one of our k8s clusters where vault-1 is crashlooping, however the raft cluster appears healthy (from vault-2):

/ # vault operator raft list-peers
Node                                    Address         State       Voter
----                                    -------         -----       -----
bb7acae0-ae42-dc08-8b8c-f7183f14dc89    vault-0:8201    leader      true
a496141a-19be-a5be-8dac-e99c2174957d    vault-1:8201    follower    true
b6d750de-0da8-b158-d372-ee8ea33a251e    vault-2:8201    follower    true

github-actions · 2024-02-18T00:27:06Z

Thank you for your contribution! This issue has been automatically marked as stale because it has no recent activity in the last 60 days. It will be closed in 20 days, if no further activity occurs. If this issue is still relevant, please leave a comment to let us know, and the stale label will be automatically removed.

github-actions · 2024-03-10T00:27:24Z

This issue has been marked stale for 20 days, and is now closed due to inactivity. If the issue is still relevant, please re-open this issue or file a new one. Thank you!

github-actions · 2024-05-12T00:29:19Z

Thank you for your contribution! This issue has been automatically marked as stale because it has no recent activity in the last 60 days. It will be closed in 20 days, if no further activity occurs. If this issue is still relevant, please leave a comment to let us know, and the stale label will be automatically removed.

github-actions bot added the lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed. label Feb 18, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2024

csatib02 reopened this Mar 10, 2024

csatib02 removed the lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed. label Mar 10, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed. label May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault with raft storage failed to join after applying changes in manifest #1627

Vault with raft storage failed to join after applying changes in manifest #1627

revathyr13 commented May 23, 2022 •

edited

pzim commented Feb 10, 2023

github-actions bot commented Feb 18, 2024

github-actions bot commented Mar 10, 2024

github-actions bot commented May 12, 2024

Vault with raft storage failed to join after applying changes in manifest #1627

Vault with raft storage failed to join after applying changes in manifest #1627

Comments

revathyr13 commented May 23, 2022 • edited

pzim commented Feb 10, 2023

github-actions bot commented Feb 18, 2024

github-actions bot commented Mar 10, 2024

github-actions bot commented May 12, 2024

revathyr13 commented May 23, 2022 •

edited