You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are experiencing an issue where the Chaos Controller Manager pods are terminating frequently. The logs indicate that the problem is due to a "leader election lost" error. This issue leads to instability in our Chaos Mesh environment, affecting our chaos experiments.
Error Message
E0220 11:41:16.310098 1 leaderelection.go:367] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/chaos-mesh/leases/chaos-mesh": context deadline exceeded
I0220 11:41:16.310258 1 leaderelection.go:283] failed to renew lease chaos-mesh/chaos-mesh: timed out waiting for the condition
2024-02-20T11:41:16.310Z ERROR setup chaos-controller-manager/main.go:208 unable to start manager {"error": "leader election lost"}
main.Run
/home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:208
reflect.Value.call
/usr/local/go/src/reflect/value.go:584
reflect.Value.Call
/usr/local/go/src/reflect/value.go:368
go.uber.org/dig.defaultInvoker
/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/container.go:238
go.uber.org/dig.(*Scope).Invoke
/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:108
go.uber.org/dig.(*Container).Invoke
/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:50
go.uber.org/fx.runInvoke
/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/invoke.go:108
go.uber.org/fx.(*module).executeInvoke
/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:246
go.uber.org/fx.(*module).executeInvokes
/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:232
go.uber.org/fx.New
/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/app.go:502
main.main
/home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:80
runtime.main
/usr/local/go/src/runtime/proc.go:250
Pod Status
The kubectl get pod command shows multiple restarts for the Chaos Controller Manager pods, indicating frequent terminations:
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-d776d57c9-fpnth 1/1 Running 9 (4h37m ago) 6d14h
chaos-controller-manager-d776d57c9-s8x62 1/1 Running 9 (27h ago) 6d14h
chaos-controller-manager-d776d57c9-sqrg7 1/1 Running 10 (136m ago) 6d14h
...
Expected behaviour
Chaos Controller Manager pods should maintain stability without frequent restarts due to leader election issues.
Actual behaviour
Pods under the Chaos Controller Manager are frequently restarting due to a "leader election lost" error, leading to instability in chaos experiments.
Steps to Reproduce
Deploy Chaos Mesh in a Kubernetes cluster.
Deploy multiple Chaos experiments over time. I think it should be at least a day or more...
Observe the logs and status of the Chaos Controller Manager pods over time.
Notice the frequent restarts and the associated error logs.
and IIUC, as there is an alive leader, chaos mesh controller manager would works well. controller runtime run leader election based on kubernetes objects, like configmap, leaderelection, etc. so you could inspect the corresponding object, as there would be the name of an alive pod, it should be work well.
about when it would loss leader, I have no clear clear about it. maybe we could dive it deeper together.
Bug Report
Description
We are experiencing an issue where the Chaos Controller Manager pods are terminating frequently. The logs indicate that the problem is due to a "leader election lost" error. This issue leads to instability in our Chaos Mesh environment, affecting our chaos experiments.
Error Message
Pod Status
The kubectl get pod command shows multiple restarts for the Chaos Controller Manager pods, indicating frequent terminations:
Expected behaviour
Chaos Controller Manager pods should maintain stability without frequent restarts due to leader election issues.
Actual behaviour
Pods under the Chaos Controller Manager are frequently restarting due to a "leader election lost" error, leading to instability in chaos experiments.
Steps to Reproduce
Environment
Chaos Mesh version: 2.6.3
Kubernetes version: v1.27.10
Docker engine: cri-o
The text was updated successfully, but these errors were encountered: