Frequent Pod Restarts Due to Leader Election Loss in Chaos Controller Manager #4351

see-quick · 2024-02-20T16:28:59Z

Bug Report

Description

We are experiencing an issue where the Chaos Controller Manager pods are terminating frequently. The logs indicate that the problem is due to a "leader election lost" error. This issue leads to instability in our Chaos Mesh environment, affecting our chaos experiments.

Error Message

E0220 11:41:16.310098       1 leaderelection.go:367] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/chaos-mesh/leases/chaos-mesh": context deadline exceeded
I0220 11:41:16.310258       1 leaderelection.go:283] failed to renew lease chaos-mesh/chaos-mesh: timed out waiting for the condition
2024-02-20T11:41:16.310Z	ERROR	setup	chaos-controller-manager/main.go:208	unable to start manager	{"error": "leader election lost"}
main.Run
	/home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:208
reflect.Value.call
	/usr/local/go/src/reflect/value.go:584
reflect.Value.Call
	/usr/local/go/src/reflect/value.go:368
go.uber.org/dig.defaultInvoker
	/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/container.go:238
go.uber.org/dig.(*Scope).Invoke
	/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:108
go.uber.org/dig.(*Container).Invoke
	/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:50
go.uber.org/fx.runInvoke
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/invoke.go:108
go.uber.org/fx.(*module).executeInvoke
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:246
go.uber.org/fx.(*module).executeInvokes
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:232
go.uber.org/fx.New
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/app.go:502
main.main
	/home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:80
runtime.main
	/usr/local/go/src/runtime/proc.go:250

Pod Status

The kubectl get pod command shows multiple restarts for the Chaos Controller Manager pods, indicating frequent terminations:

NAME                                       READY   STATUS    RESTARTS        AGE
chaos-controller-manager-d776d57c9-fpnth   1/1     Running   9 (4h37m ago)   6d14h
chaos-controller-manager-d776d57c9-s8x62   1/1     Running   9 (27h ago)     6d14h
chaos-controller-manager-d776d57c9-sqrg7   1/1     Running   10 (136m ago)   6d14h
...

Expected behaviour

Chaos Controller Manager pods should maintain stability without frequent restarts due to leader election issues.

Actual behaviour

Pods under the Chaos Controller Manager are frequently restarting due to a "leader election lost" error, leading to instability in chaos experiments.

Steps to Reproduce

Deploy Chaos Mesh in a Kubernetes cluster.
Deploy multiple Chaos experiments over time. I think it should be at least a day or more...
Observe the logs and status of the Chaos Controller Manager pods over time.
Notice the frequent restarts and the associated error logs.

Environment

Chaos Mesh version: 2.6.3
Kubernetes version: v1.27.10
Docker engine: cri-o

The text was updated successfully, but these errors were encountered:

STRRL · 2024-03-05T14:17:34Z

we build the leader election functionality based on controller runtime. https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/leaderelection

and IIUC, as there is an alive leader, chaos mesh controller manager would works well. controller runtime run leader election based on kubernetes objects, like configmap, leaderelection, etc. so you could inspect the corresponding object, as there would be the name of an alive pod, it should be work well.

about when it would loss leader, I have no clear clear about it. maybe we could dive it deeper together.

see-quick · 2024-03-20T10:51:28Z

about when it would loss leader, I have no clear clear about it. maybe we could dive it deeper together.

Yeah sure.

oc get pod
NAME                                        READY   STATUS        RESTARTS       AGE
chaos-controller-manager-5bc74d7948-bmsdp   1/1     Running       21 (12m ago)   2d8h
chaos-controller-manager-5bc74d7948-l6bk9   1/1     Running       21 (29m ago)   2d8h
chaos-controller-manager-5bc74d7948-mt8cx   1/1     Running       19 (29m ago)   2d8h

It happens pretty frequently so we can look at that.

dekhtyarev · 2024-04-18T03:23:59Z

Hi!
We have the same problem:

$ kubectl -n chaos-mesh get pods | grep controller
chaos-controller-manager-7c5cd68cc9-khbpr   1/1     Running   4 (124m ago)    21h
chaos-controller-manager-7c5cd68cc9-n5qct   1/1     Running   12 (4h4m ago)   40h
chaos-controller-manager-7c5cd68cc9-w7tjn   1/1     Running   12 (64m ago)    38h

Chaos controller logs are the same as in the first message.

@STRRL, can you plan to do some research on the problem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent Pod Restarts Due to Leader Election Loss in Chaos Controller Manager #4351

Frequent Pod Restarts Due to Leader Election Loss in Chaos Controller Manager #4351

see-quick commented Feb 20, 2024

STRRL commented Mar 5, 2024

see-quick commented Mar 20, 2024

dekhtyarev commented Apr 18, 2024

Frequent Pod Restarts Due to Leader Election Loss in Chaos Controller Manager #4351

Frequent Pod Restarts Due to Leader Election Loss in Chaos Controller Manager #4351

Comments

see-quick commented Feb 20, 2024

Bug Report

Description

Error Message

Pod Status

Expected behaviour

Actual behaviour

Steps to Reproduce

Environment

STRRL commented Mar 5, 2024

see-quick commented Mar 20, 2024

dekhtyarev commented Apr 18, 2024