Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-cluster chaos is broken after chaos manager Pod gets recreated #4376

Open
timflannagan opened this issue Mar 19, 2024 · 4 comments
Open

Comments

@timflannagan
Copy link

Bug Report

What version of Kubernetes are you using?

Kuberentes 1.27

What version of Chaos Mesh are you using?

2.6.3

What did you do? / Minimal Reproducible Example

  • Create a cluster-1 & cluster-2 RemoteCluster CR that reference valid Secrets
  • Create a PodChaos resource that references the cluster-1 remote cluster
  • Verify that the PodChaos resources gets created in the cluster-1 cluster & the configured chaos experiment works as intended
  • Delete the chaos manager Pod in the "management", "base", etc. cluster (i.e. the cluster where those RemoteCluster CRs live) and re-create the same PodChaos resource that was previously working
  • Verify the following log message is present in the new chaos manager Pod:
2024-03-19T01:26:56.673Z	ERROR	remotechaos	remotechaos/controller.go:120	unable to handle chaos	{"error": "lookup cluster: cluster-1: controllers of cluster doesn't exist", "errorVerbose": "controllers of cluster doesn't exist\ngithub.com/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry.init\n\t/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry/error.go:22\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6329\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6306\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6306\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172\nlookup cluster: cluster-1\ngithub.com/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry.(*RemoteClusterRegistry).WithClient\n\t/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/clusterregistry/registry.go:125\ngithub.com/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos.(*Reconciler).Reconcile\n\t/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos/controller.go:54\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172"}
github.com/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos.(*Reconciler).Reconcile
	/runner/_work/chaos-mesh/chaos-mesh/controllers/multicluster/remotechaos/controller.go:120
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/tmp/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.2/pkg/internal/controller/controller.go:235
  • Verify that the new PodChaos resource no longer works

What did you expect to see?

The multi-cluster chaos functionality still works after the manager container has been deleted/kicked/etc. I'm guessing there's an in-memory cache being managed that isn't being hydrated correctly in these edge cases.

What did you see instead?

The multi-cluster capabilities are more resilient to churn with the manager Pod. This was also reproducible when upgrading the helm chart from 2.6.2 -> 2.6.3 since that also involved a new chaos manager pod being rolled out.

Output of chaosctl

@cwen0
Copy link
Member

cwen0 commented Mar 19, 2024

Delete the chaos manager Pod in the "management", "base", etc. cluster (i.e. the cluster where those RemoteCluster CRs live) and re-create the same PodChaos resource that was previously working

Master Chaosmanager pod didn't comulate with remote chaos-manage pod directly and comulate with remote cluster's apiserver. Can you check if the target remotecluster crd is exist and the remote cluster kubeconfig changed?

@timflannagan
Copy link
Author

timflannagan commented Mar 19, 2024

Can you check if the target remotecluster crd is exist

Yep, here are the RemoteCluster CRs in my local environment. I'm spinning up three kind clusters, where the kc1 and kc2 alias' are pointing to the kind-cluster-1 and kind-cluster-2 kube contexts. In this case, we have a base cluster named "kind-mgmt-cluster" which houses all the RemoteCluster CRs:

$ kc1 get remotecluster -A
No resources found
$ kc2 get remotecluster -A
No resources found
$ k get remotecluster -A -oyaml
apiVersion: v1
items:
- apiVersion: chaos-mesh.org/v1alpha1
  kind: RemoteCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"chaos-mesh.org/v1alpha1","kind":"RemoteCluster","metadata":{"annotations":{},"name":"cluster-1"},"spec":{"configOverride":{"chaosDaemon":{"hostNetwork":true,"privileged":true,"runtime":"containerd","socketPath":"/run/containerd/containerd.sock"},"controllerManager":{"leaderElection":{"enabled":false},"replicaCount":1},"dashboard":{"create":false}},"kubeConfig":{"secretRef":{"key":"kubeconfig","name":"cluster-1-kubeconfig","namespace":"default"}},"namespace":"chaos-mesh","version":"2.6.3"}}
    creationTimestamp: "2024-03-19T01:34:54Z"
    finalizers:
    - chaos-mesh/remotecluster-controllers
    generation: 2
    name: cluster-1
    resourceVersion: "49064"
    uid: 03c71898-53e9-4172-8b08-c13f1f02f166
  spec:
    configOverride:
      chaosDaemon:
        hostNetwork: true
        privileged: true
        runtime: containerd
        socketPath: /run/containerd/containerd.sock
      controllerManager:
        leaderElection:
          enabled: false
        replicaCount: 1
      dashboard:
        create: false
    kubeConfig:
      secretRef:
        key: kubeconfig
        name: cluster-1-kubeconfig
        namespace: default
    namespace: chaos-mesh
    version: 2.6.3
  status:
    currentVersion: 2.6.3
    observedGeneration: 2
- apiVersion: chaos-mesh.org/v1alpha1
  kind: RemoteCluster
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"chaos-mesh.org/v1alpha1","kind":"RemoteCluster","metadata":{"annotations":{},"name":"cluster-2"},"spec":{"configOverride":{"chaosDaemon":{"hostNetwork":true,"privileged":true,"runtime":"containerd","socketPath":"/run/containerd/containerd.sock"},"controllerManager":{"leaderElection":{"enabled":false},"replicaCount":1},"dashboard":{"create":false}},"kubeConfig":{"secretRef":{"key":"kubeconfig","name":"cluster-2-kubeconfig","namespace":"default"}},"namespace":"chaos-mesh","version":"2.6.3"}}
    creationTimestamp: "2024-03-19T01:34:54Z"
    finalizers:
    - chaos-mesh/remotecluster-controllers
    generation: 2
    name: cluster-2
    resourceVersion: "49075"
    uid: 6bceffa1-1b78-4a01-a41b-cfb1c0d2588d
  spec:
    configOverride:
      chaosDaemon:
        hostNetwork: true
        privileged: true
        runtime: containerd
        socketPath: /run/containerd/containerd.sock
      controllerManager:
        leaderElection:
          enabled: false
        replicaCount: 1
      dashboard:
        create: false
    kubeConfig:
      secretRef:
        key: kubeconfig
        name: cluster-2-kubeconfig
        namespace: default
    namespace: chaos-mesh
    version: 2.6.3
  status:
    currentVersion: 2.6.3
    observedGeneration: 2
kind: List
metadata:
  resourceVersion: ""

And verified the CRD exists for those remote clusters:

$ kc1 api-resources | grep remotecluster
remoteclusters                                            chaos-mesh.org/v1alpha1                 false        RemoteCluster
$ kc2 api-resources | grep remotecluster
remoteclusters                                            chaos-mesh.org/v1alpha1                 false        RemoteCluster

and the remote cluster kubeconfig changed

Both of the remote clusters haven't been updated when the base cluster's chaos manager pod gets kicked. i.e. there weren't any updates to the control plane components for those remote clusters, or any updates to the kubeconfig Secrets that live in the base cluster. Hope that makes sense, but let me know if you need any additional details.

As an aside, I used the following bash function to generate the kubeconfig files for those kind clusters:

$ declare -f kind_write_kubeconfig_files
kind_write_kubeconfig_files () {
	context=${1:-kind-mgmt-cluster} 
	kubectl config set-context $context
	for cluster in $(kind get clusters)
	do
		if [[ $cluster == "mgmt-cluster" ]]
		then
			continue
		fi
		external_ip=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $cluster-control-plane) 
		kind get kubeconfig --name $cluster | sed -E "s/server: https:\/\/[0-9\.]+:[0-9]+/server: https:\/\/$external_ip:6443/g" > .bin/$cluster.kubeconfig
		kubectl -n default create secret generic $cluster-kubeconfig --from-file=kubeconfig=.bin/$cluster.kubeconfig
	done
}

I was largely following the multi-cluster documentation and the steps outlined in the #4150 issue. I'm also able to consistently reproduce this, so let me know if more information is needed here. Lastly, I tried manually restarting the chaos manager pod in the remote clusters to see whether there were any potential races between the base and remote cluster's chaos mesh deployments, but I didn't have any luck there as well.

@nioshield
Copy link
Contributor

Delete the chaos manager Pod in the "management", "base", etc. cluster (i.e. the cluster where those RemoteCluster CRs live) and re-create the same PodChaos resource that was previously working

It seems to be a problem with that #4208.
After you delete the controller pod, you cannot re-register the previous RemoteCluster

@timflannagan
Copy link
Author

@nioshield Awesome - thanks for the link! Okay, sounds like this is a known issue that folks have seen before then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants