-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-cluster chaos is broken after chaos manager Pod gets recreated #4376
Comments
Master Chaosmanager pod didn't comulate with remote chaos-manage pod directly and comulate with remote cluster's apiserver. Can you check if the target remotecluster crd is exist and the remote cluster kubeconfig changed? |
Yep, here are the RemoteCluster CRs in my local environment. I'm spinning up three kind clusters, where the $ kc1 get remotecluster -A
No resources found
$ kc2 get remotecluster -A
No resources found
$ k get remotecluster -A -oyaml
apiVersion: v1
items:
- apiVersion: chaos-mesh.org/v1alpha1
kind: RemoteCluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"chaos-mesh.org/v1alpha1","kind":"RemoteCluster","metadata":{"annotations":{},"name":"cluster-1"},"spec":{"configOverride":{"chaosDaemon":{"hostNetwork":true,"privileged":true,"runtime":"containerd","socketPath":"/run/containerd/containerd.sock"},"controllerManager":{"leaderElection":{"enabled":false},"replicaCount":1},"dashboard":{"create":false}},"kubeConfig":{"secretRef":{"key":"kubeconfig","name":"cluster-1-kubeconfig","namespace":"default"}},"namespace":"chaos-mesh","version":"2.6.3"}}
creationTimestamp: "2024-03-19T01:34:54Z"
finalizers:
- chaos-mesh/remotecluster-controllers
generation: 2
name: cluster-1
resourceVersion: "49064"
uid: 03c71898-53e9-4172-8b08-c13f1f02f166
spec:
configOverride:
chaosDaemon:
hostNetwork: true
privileged: true
runtime: containerd
socketPath: /run/containerd/containerd.sock
controllerManager:
leaderElection:
enabled: false
replicaCount: 1
dashboard:
create: false
kubeConfig:
secretRef:
key: kubeconfig
name: cluster-1-kubeconfig
namespace: default
namespace: chaos-mesh
version: 2.6.3
status:
currentVersion: 2.6.3
observedGeneration: 2
- apiVersion: chaos-mesh.org/v1alpha1
kind: RemoteCluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"chaos-mesh.org/v1alpha1","kind":"RemoteCluster","metadata":{"annotations":{},"name":"cluster-2"},"spec":{"configOverride":{"chaosDaemon":{"hostNetwork":true,"privileged":true,"runtime":"containerd","socketPath":"/run/containerd/containerd.sock"},"controllerManager":{"leaderElection":{"enabled":false},"replicaCount":1},"dashboard":{"create":false}},"kubeConfig":{"secretRef":{"key":"kubeconfig","name":"cluster-2-kubeconfig","namespace":"default"}},"namespace":"chaos-mesh","version":"2.6.3"}}
creationTimestamp: "2024-03-19T01:34:54Z"
finalizers:
- chaos-mesh/remotecluster-controllers
generation: 2
name: cluster-2
resourceVersion: "49075"
uid: 6bceffa1-1b78-4a01-a41b-cfb1c0d2588d
spec:
configOverride:
chaosDaemon:
hostNetwork: true
privileged: true
runtime: containerd
socketPath: /run/containerd/containerd.sock
controllerManager:
leaderElection:
enabled: false
replicaCount: 1
dashboard:
create: false
kubeConfig:
secretRef:
key: kubeconfig
name: cluster-2-kubeconfig
namespace: default
namespace: chaos-mesh
version: 2.6.3
status:
currentVersion: 2.6.3
observedGeneration: 2
kind: List
metadata:
resourceVersion: "" And verified the CRD exists for those remote clusters: $ kc1 api-resources | grep remotecluster
remoteclusters chaos-mesh.org/v1alpha1 false RemoteCluster
$ kc2 api-resources | grep remotecluster
remoteclusters chaos-mesh.org/v1alpha1 false RemoteCluster
Both of the remote clusters haven't been updated when the base cluster's chaos manager pod gets kicked. i.e. there weren't any updates to the control plane components for those remote clusters, or any updates to the kubeconfig Secrets that live in the base cluster. Hope that makes sense, but let me know if you need any additional details. As an aside, I used the following bash function to generate the kubeconfig files for those kind clusters: $ declare -f kind_write_kubeconfig_files
kind_write_kubeconfig_files () {
context=${1:-kind-mgmt-cluster}
kubectl config set-context $context
for cluster in $(kind get clusters)
do
if [[ $cluster == "mgmt-cluster" ]]
then
continue
fi
external_ip=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $cluster-control-plane)
kind get kubeconfig --name $cluster | sed -E "s/server: https:\/\/[0-9\.]+:[0-9]+/server: https:\/\/$external_ip:6443/g" > .bin/$cluster.kubeconfig
kubectl -n default create secret generic $cluster-kubeconfig --from-file=kubeconfig=.bin/$cluster.kubeconfig
done
} I was largely following the multi-cluster documentation and the steps outlined in the #4150 issue. I'm also able to consistently reproduce this, so let me know if more information is needed here. Lastly, I tried manually restarting the chaos manager pod in the remote clusters to see whether there were any potential races between the base and remote cluster's chaos mesh deployments, but I didn't have any luck there as well. |
It seems to be a problem with that #4208. |
@nioshield Awesome - thanks for the link! Okay, sounds like this is a known issue that folks have seen before then. |
Bug Report
What version of Kubernetes are you using?
Kuberentes 1.27
What version of Chaos Mesh are you using?
2.6.3
What did you do? / Minimal Reproducible Example
What did you expect to see?
The multi-cluster chaos functionality still works after the manager container has been deleted/kicked/etc. I'm guessing there's an in-memory cache being managed that isn't being hydrated correctly in these edge cases.
What did you see instead?
The multi-cluster capabilities are more resilient to churn with the manager Pod. This was also reproducible when upgrading the helm chart from 2.6.2 -> 2.6.3 since that also involved a new chaos manager pod being rolled out.
Output of chaosctl
The text was updated successfully, but these errors were encountered: