Multi-cluster chaos is broken after chaos manager Pod gets recreated #4376

timflannagan · 2024-03-19T01:43:21Z

Bug Report

What version of Kubernetes are you using?

Kuberentes 1.27

What version of Chaos Mesh are you using?

2.6.3

What did you do? / Minimal Reproducible Example

Create a cluster-1 & cluster-2 RemoteCluster CR that reference valid Secrets
Create a PodChaos resource that references the cluster-1 remote cluster
Verify that the PodChaos resources gets created in the cluster-1 cluster & the configured chaos experiment works as intended
Delete the chaos manager Pod in the "management", "base", etc. cluster (i.e. the cluster where those RemoteCluster CRs live) and re-create the same PodChaos resource that was previously working
Verify the following log message is present in the new chaos manager Pod:

2024-03-19T01:26:56.673Z	ERROR	remotechaos	remotechaos/controller.go:120	unable to handle chaos	{"error": "lookup cluster: cluster-1: controllers of cluster doesn't exist", "errorVerbose": "controllers of cluster doesn't exist\ngithub.com/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry.init\n\t/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry/error.go:22\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6329\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6306\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6306\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172\nlookup cluster: cluster-1\ngithub.com/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry.(*RemoteClusterRegistry).WithClient\n\t/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry/registry.go:125\ngithub.com/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos.(*Reconciler).Reconcile\n\t/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos/controller.go:54\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172"}
github.com/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos.(*Reconciler).Reconcile
	/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos/controller.go:120
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:235

Verify that the new PodChaos resource no longer works

What did you expect to see?

The multi-cluster chaos functionality still works after the manager container has been deleted/kicked/etc. I'm guessing there's an in-memory cache being managed that isn't being hydrated correctly in these edge cases.

What did you see instead?

The multi-cluster capabilities are more resilient to churn with the manager Pod. This was also reproducible when upgrading the helm chart from 2.6.2 -> 2.6.3 since that also involved a new chaos manager pod being rolled out.

Output of chaosctl

The text was updated successfully, but these errors were encountered:

cwen0 · 2024-03-19T13:50:44Z

Delete the chaos manager Pod in the "management", "base", etc. cluster (i.e. the cluster where those RemoteCluster CRs live) and re-create the same PodChaos resource that was previously working

Master Chaosmanager pod didn't comulate with remote chaos-manage pod directly and comulate with remote cluster's apiserver. Can you check if the target remotecluster crd is exist and the remote cluster kubeconfig changed?

timflannagan · 2024-03-19T18:43:45Z

Can you check if the target remotecluster crd is exist

Yep, here are the RemoteCluster CRs in my local environment. I'm spinning up three kind clusters, where the kc1 and kc2 alias' are pointing to the kind-cluster-1 and kind-cluster-2 kube contexts. In this case, we have a base cluster named "kind-mgmt-cluster" which houses all the RemoteCluster CRs:

$ kc1 get remotecluster -A
No resources found
$ kc2 get remotecluster -A
No resources found
$ k get remotecluster -A -oyaml
apiVersion: v1
items:
- apiVersion: chaos-mesh.org/v1alpha1
  kind: RemoteCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"chaos-mesh.org/v1alpha1","kind":"RemoteCluster","metadata":{"annotations":{},"name":"cluster-1"},"spec":{"configOverride":{"chaosDaemon":{"hostNetwork":true,"privileged":true,"runtime":"containerd","socketPath":"/run/containerd/containerd.sock"},"controllerManager":{"leaderElection":{"enabled":false},"replicaCount":1},"dashboard":{"create":false}},"kubeConfig":{"secretRef":{"key":"kubeconfig","name":"cluster-1-kubeconfig","namespace":"default"}},"namespace":"chaos-mesh","version":"2.6.3"}}
    creationTimestamp: "2024-03-19T01:34:54Z"
    finalizers:
    - chaos-mesh/remotecluster-controllers
    generation: 2
    name: cluster-1
    resourceVersion: "49064"
    uid: 03c71898-53e9-4172-8b08-c13f1f02f166
  spec:
    configOverride:
      chaosDaemon:
        hostNetwork: true
        privileged: true
        runtime: containerd
        socketPath: /run/containerd/containerd.sock
      controllerManager:
        leaderElection:
          enabled: false
        replicaCount: 1
      dashboard:
        create: false
    kubeConfig:
      secretRef:
        key: kubeconfig
        name: cluster-1-kubeconfig
        namespace: default
    namespace: chaos-mesh
    version: 2.6.3
  status:
    currentVersion: 2.6.3
    observedGeneration: 2
- apiVersion: chaos-mesh.org/v1alpha1
  kind: RemoteCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"chaos-mesh.org/v1alpha1","kind":"RemoteCluster","metadata":{"annotations":{},"name":"cluster-2"},"spec":{"configOverride":{"chaosDaemon":{"hostNetwork":true,"privileged":true,"runtime":"containerd","socketPath":"/run/containerd/containerd.sock"},"controllerManager":{"leaderElection":{"enabled":false},"replicaCount":1},"dashboard":{"create":false}},"kubeConfig":{"secretRef":{"key":"kubeconfig","name":"cluster-2-kubeconfig","namespace":"default"}},"namespace":"chaos-mesh","version":"2.6.3"}}
    creationTimestamp: "2024-03-19T01:34:54Z"
    finalizers:
    - chaos-mesh/remotecluster-controllers
    generation: 2
    name: cluster-2
    resourceVersion: "49075"
    uid: 6bceffa1-1b78-4a01-a41b-cfb1c0d2588d
  spec:
    configOverride:
      chaosDaemon:
        hostNetwork: true
        privileged: true
        runtime: containerd
        socketPath: /run/containerd/containerd.sock
      controllerManager:
        leaderElection:
          enabled: false
        replicaCount: 1
      dashboard:
        create: false
    kubeConfig:
      secretRef:
        key: kubeconfig
        name: cluster-2-kubeconfig
        namespace: default
    namespace: chaos-mesh
    version: 2.6.3
  status:
    currentVersion: 2.6.3
    observedGeneration: 2
kind: List
metadata:
  resourceVersion: ""

And verified the CRD exists for those remote clusters:

$ kc1 api-resources | grep remotecluster
remoteclusters                                            chaos-mesh.org/v1alpha1                 false        RemoteCluster
$ kc2 api-resources | grep remotecluster
remoteclusters                                            chaos-mesh.org/v1alpha1                 false        RemoteCluster

and the remote cluster kubeconfig changed

Both of the remote clusters haven't been updated when the base cluster's chaos manager pod gets kicked. i.e. there weren't any updates to the control plane components for those remote clusters, or any updates to the kubeconfig Secrets that live in the base cluster. Hope that makes sense, but let me know if you need any additional details.

As an aside, I used the following bash function to generate the kubeconfig files for those kind clusters:

$ declare -f kind_write_kubeconfig_files
kind_write_kubeconfig_files () {
	context=${1:-kind-mgmt-cluster} 
	kubectl config set-context $context
	for cluster in $(kind get clusters)
	do
		if [[ $cluster == "mgmt-cluster" ]]
		then
			continue
		fi
		external_ip=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $cluster-control-plane) 
		kind get kubeconfig --name $cluster | sed -E "s/server: https:\/\/[0-9\.]+:[0-9]+/server: https:\/\/$external_ip:6443/g" > .bin/$cluster.kubeconfig
		kubectl -n default create secret generic $cluster-kubeconfig --from-file=kubeconfig=.bin/$cluster.kubeconfig
	done
}

I was largely following the multi-cluster documentation and the steps outlined in the #4150 issue. I'm also able to consistently reproduce this, so let me know if more information is needed here. Lastly, I tried manually restarting the chaos manager pod in the remote clusters to see whether there were any potential races between the base and remote cluster's chaos mesh deployments, but I didn't have any luck there as well.

nioshield · 2024-03-19T22:55:55Z

Delete the chaos manager Pod in the "management", "base", etc. cluster (i.e. the cluster where those RemoteCluster CRs live) and re-create the same PodChaos resource that was previously working

It seems to be a problem with that #4208.
After you delete the controller pod, you cannot re-register the previous RemoteCluster

timflannagan · 2024-03-19T23:14:55Z

@nioshield Awesome - thanks for the link! Okay, sounds like this is a known issue that folks have seen before then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-cluster chaos is broken after chaos manager Pod gets recreated #4376

Multi-cluster chaos is broken after chaos manager Pod gets recreated #4376

timflannagan commented Mar 19, 2024

cwen0 commented Mar 19, 2024

timflannagan commented Mar 19, 2024 •

edited

nioshield commented Mar 19, 2024

timflannagan commented Mar 19, 2024

Multi-cluster chaos is broken after chaos manager Pod gets recreated #4376

Multi-cluster chaos is broken after chaos manager Pod gets recreated #4376

Comments

timflannagan commented Mar 19, 2024

Bug Report

cwen0 commented Mar 19, 2024

timflannagan commented Mar 19, 2024 • edited

nioshield commented Mar 19, 2024

timflannagan commented Mar 19, 2024

timflannagan commented Mar 19, 2024 •

edited