Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent Pod Restarts Due to Leader Election Loss in Chaos Controller Manager #4351

Open
see-quick opened this issue Feb 20, 2024 · 3 comments

Comments

@see-quick
Copy link

Bug Report

Description

We are experiencing an issue where the Chaos Controller Manager pods are terminating frequently. The logs indicate that the problem is due to a "leader election lost" error. This issue leads to instability in our Chaos Mesh environment, affecting our chaos experiments.

Error Message

E0220 11:41:16.310098       1 leaderelection.go:367] Failed to update lock: Put "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/chaos-mesh/leases/chaos-mesh": context deadline exceeded
I0220 11:41:16.310258       1 leaderelection.go:283] failed to renew lease chaos-mesh/chaos-mesh: timed out waiting for the condition
2024-02-20T11:41:16.310Z	ERROR	setup	chaos-controller-manager/main.go:208	unable to start manager	{"error": "leader election lost"}
main.Run
	/home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:208
reflect.Value.call
	/usr/local/go/src/reflect/value.go:584
reflect.Value.Call
	/usr/local/go/src/reflect/value.go:368
go.uber.org/dig.defaultInvoker
	/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/container.go:238
go.uber.org/dig.(*Scope).Invoke
	/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:108
go.uber.org/dig.(*Container).Invoke
	/tmp/go/pkg/mod/go.uber.org/dig@v1.16.1/invoke.go:50
go.uber.org/fx.runInvoke
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/invoke.go:108
go.uber.org/fx.(*module).executeInvoke
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:246
go.uber.org/fx.(*module).executeInvokes
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/module.go:232
go.uber.org/fx.New
	/tmp/go/pkg/mod/go.uber.org/fx@v1.19.2/app.go:502
main.main
	/home/runner/work/chaos-mesh/chaos-mesh/cmd/chaos-controller-manager/main.go:80
runtime.main
	/usr/local/go/src/runtime/proc.go:250

Pod Status

The kubectl get pod command shows multiple restarts for the Chaos Controller Manager pods, indicating frequent terminations:

NAME                                       READY   STATUS    RESTARTS        AGE
chaos-controller-manager-d776d57c9-fpnth   1/1     Running   9 (4h37m ago)   6d14h
chaos-controller-manager-d776d57c9-s8x62   1/1     Running   9 (27h ago)     6d14h
chaos-controller-manager-d776d57c9-sqrg7   1/1     Running   10 (136m ago)   6d14h
...

Expected behaviour

Chaos Controller Manager pods should maintain stability without frequent restarts due to leader election issues.

Actual behaviour

Pods under the Chaos Controller Manager are frequently restarting due to a "leader election lost" error, leading to instability in chaos experiments.

Steps to Reproduce

  1. Deploy Chaos Mesh in a Kubernetes cluster.
  2. Deploy multiple Chaos experiments over time. I think it should be at least a day or more...
  3. Observe the logs and status of the Chaos Controller Manager pods over time.
  4. Notice the frequent restarts and the associated error logs.

Environment

Chaos Mesh version: 2.6.3
Kubernetes version: v1.27.10
Docker engine: cri-o

@STRRL
Copy link
Member

STRRL commented Mar 5, 2024

we build the leader election functionality based on controller runtime. https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/leaderelection

and IIUC, as there is an alive leader, chaos mesh controller manager would works well. controller runtime run leader election based on kubernetes objects, like configmap, leaderelection, etc. so you could inspect the corresponding object, as there would be the name of an alive pod, it should be work well.

about when it would loss leader, I have no clear clear about it. maybe we could dive it deeper together.

@see-quick
Copy link
Author

about when it would loss leader, I have no clear clear about it. maybe we could dive it deeper together.

Yeah sure.

oc get pod
NAME                                        READY   STATUS        RESTARTS       AGE
chaos-controller-manager-5bc74d7948-bmsdp   1/1     Running       21 (12m ago)   2d8h
chaos-controller-manager-5bc74d7948-l6bk9   1/1     Running       21 (29m ago)   2d8h
chaos-controller-manager-5bc74d7948-mt8cx   1/1     Running       19 (29m ago)   2d8h

It happens pretty frequently so we can look at that.

@dekhtyarev
Copy link

Hi!
We have the same problem:

$ kubectl -n chaos-mesh get pods | grep controller
chaos-controller-manager-7c5cd68cc9-khbpr   1/1     Running   4 (124m ago)    21h
chaos-controller-manager-7c5cd68cc9-n5qct   1/1     Running   12 (4h4m ago)   40h
chaos-controller-manager-7c5cd68cc9-w7tjn   1/1     Running   12 (64m ago)    38h

Chaos controller logs are the same as in the first message.

@STRRL, can you plan to do some research on the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants