Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault with raft storage failed to join after applying changes in manifest #1627

Open
revathyr13 opened this issue May 23, 2022 · 4 comments
Open
Labels
lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed.

Comments

@revathyr13
Copy link

revathyr13 commented May 23, 2022

Hello Team,

We are facing auto join issues with vault HA raft cluster in kubernetes.
vault.txt

Issue

We are deploying the vault cluster in kubernetes as raft storage. The unseal keys are stored in our master vault.
The manifest we are using to deploy the vault cluster is attached . For the first time the deployment looks good. All vault pods will come up without any issues.

NAME READY STATUS RESTARTS AGE
vault-0 3/3 Running 0 9m44s
vault-1 3/3 Running 0 5m32s
vault-2 3/3 Running 0 49s

But once if make any change in manifest [say change settings veleroEnabled: true to veleroEnabled: false] and re-apply the changes using kubectl apply -f vault.yaml , the pod vault-2 will come up with applied change and it will show in ready state. But vault-1 wont comeup.

vault-0 3/3 Running 0 44m
vault-1 1/2 CrashLoopBackOff 12 40m
vault-2 2/2 Running 0 40m

While digging we noticed vault-2 is just showing in ready status whereas it's not joined to any of the cluster

$ kubectl exec -ti vault-2 sh
/ # export VAULT_ADDR='https://vault-2:8200'
/ # vault operator raft list-peers
No raft cluster configuration found

As it cameup with ready status, the operator starts to apply changes in vault-1 which breaks the whole raft cluster.
It shouldn't happen. The vault-2 should come up in ready state only once it joined to existing cluster which is not happening here.
Logs from vault-1

2022-05-23T06:28:40.757Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault:8200
2022-05-23T06:28:40.780Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault:8200 error="could not retrieve raft bootstrap package"
2022-05-23T06:28:40.780Z [ERROR] core: failed to join raft cluster: error="timed out on raft join: %!w()"

Logs from vault-2

2022-05-23T06:31:20.243Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error""
2022-05-23T06:31:25.243Z [DEBUG] core: forwarding: error sending echo request to active node: error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing remote error: tls: internal error"

We believe increasing the readiness limits or adding initialDelaySeconds in the operator will help. As its a operator side issue, can someone please have a look and assist.

Thank you

@pzim
Copy link

pzim commented Feb 10, 2023

@revathyr13 - did you ever figure out how to resolve vault-1 from crashlooping? Seeing something similar in one of our k8s clusters where vault-1 is crashlooping, however the raft cluster appears healthy (from vault-2):

/ # vault operator raft list-peers
Node                                    Address         State       Voter
----                                    -------         -----       -----
bb7acae0-ae42-dc08-8b8c-f7183f14dc89    vault-0:8201    leader      true
a496141a-19be-a5be-8dac-e99c2174957d    vault-1:8201    follower    true
b6d750de-0da8-b158-d372-ee8ea33a251e    vault-2:8201    follower    true

Copy link

Thank you for your contribution! This issue has been automatically marked as stale because it has no recent activity in the last 60 days. It will be closed in 20 days, if no further activity occurs. If this issue is still relevant, please leave a comment to let us know, and the stale label will be automatically removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed. label Feb 18, 2024
Copy link

This issue has been marked stale for 20 days, and is now closed due to inactivity. If the issue is still relevant, please re-open this issue or file a new one. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2024
@csatib02 csatib02 reopened this Mar 10, 2024
@csatib02 csatib02 removed the lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed. label Mar 10, 2024
Copy link

Thank you for your contribution! This issue has been automatically marked as stale because it has no recent activity in the last 60 days. It will be closed in 20 days, if no further activity occurs. If this issue is still relevant, please leave a comment to let us know, and the stale label will be automatically removed.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed. label May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR that has become stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

3 participants