Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to configure leader election #6374

Open
usiegl00 opened this issue Apr 23, 2024 · 3 comments
Open

Add option to configure leader election #6374

usiegl00 opened this issue Apr 23, 2024 · 3 comments
Labels
kind/question Categorizes an issue as a user question.

Comments

@usiegl00
Copy link

If the k8s apiserver stops responding, contour will assume that it lost a leader election and stop itself while any other contour pods will be unable to acquire the lease.
Here is the logs of the contour leader that crashes:

time="2024-04-23T03:29:58Z" level=error msg="error retrieving resource lock contour-system/contour-system-contour-contour: Get \"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/contour-system/leases/contour-system-contour-contour\": context deadline exceeded - error from a previous attempt: EOF" caller="leaderelection.go:332" context=kubernetes error="<nil>"
time="2024-04-23T03:29:58Z" level=info msg="failed to renew lease contour-system/contour-system-contour-contour: timed out waiting for the condition" caller="leaderelection.go:285" context=kubernetes
time="2024-04-23T03:30:03Z" level=error msg="Failed to release lock: Put \"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/contour-system/leases/contour-system-contour-contour\": EOF" caller="leaderelection.go:308" context=kubernetes error="<nil>"
time="2024-04-23T03:30:03Z" level=fatal msg="Contour server failed" error="leader election lost"

Here are the logs of the non-leaders:

time="2024-04-23T03:31:28Z" level=error msg="error retrieving resource lock contour-system/contour-system-contour-contour: Get \"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/contour-system/leases/contour-system-contour-contour\": dial tcp 10.0.0.1:443: connect: connection refused" caller="leaderelection.go:332" context=kubernetes error="<nil>"
time="2024-04-23T03:31:32Z" level=error msg="error retrieving resource lock contour-system/contour-system-contour-contour: Get \"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/contour-system/leases/contour-system-contour-contour\": dial tcp 10.0.0.1:443: connect: connection refused" caller="leaderelection.go:332" context=kubernetes error="<nil>"
time="2024-04-23T03:31:34Z" level=error msg="error retrieving resource lock contour-system/contour-system-contour-contour: Get \"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/contour-system/leases/contour-system-contour-contour\": dial tcp 10.0.0.1:443: connect: connection refused" caller="leaderelection.go:332" context=kubernetes error="<nil>"
time="2024-04-23T03:31:37Z" level=error msg="error retrieving resource lock contour-system/contour-system-contour-contour: Get \"https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/contour-system/leases/contour-system-contour-contour\": dial tcp 10.0.0.1:443: connect: connection refused" caller="leaderelection.go:332" context=kubernetes error="<nil>"
time="2024-04-23T03:31:41Z" level=info msg="Stopping and waiting for non leader election runnables" caller="internal.go:516" context=kubernetes
time="2024-04-23T03:31:41Z" level=info msg="stopped event handler" context=contourEventHandler
time="2024-04-23T03:31:41Z" level=error msg="terminated HTTP server with error" context=metricsvc error="http: Server closed"
time="2024-04-23T03:31:41Z" level=error msg="error received after stop sequence was engaged" caller="internal.go:490" context=kubernetes error="http: Server closed"
time="2024-04-23T03:31:41Z" level=error msg="terminated HTTP server with error" context=debugsvc error="http: Server closed"
time="2024-04-23T03:31:41Z" level=error msg="error received after stop sequence was engaged" caller="internal.go:490" context=kubernetes error="http: Server closed"
time="2024-04-23T03:31:41Z" level=info msg="stopped xDS server" address="0.0.0.0:8001" context=xds
time="2024-04-23T03:31:41Z" level=info msg="Stopping and waiting for leader election runnables" caller="internal.go:520" context=kubernetes
time="2024-04-23T03:31:41Z" level=info msg="started status update handler" context=StatusUpdateHandler
time="2024-04-23T03:31:41Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=
time="2024-04-23T03:31:41Z" level=info msg="stopped status update handler" context=StatusUpdateHandler
time="2024-04-23T03:31:41Z" level=info msg="Stopping and waiting for caches" caller="internal.go:526" context=kubernetes
time="2024-04-23T03:31:41Z" level=info msg="Stopping and waiting for webhooks" caller="internal.go:530" context=kubernetes
time="2024-04-23T03:31:41Z" level=info msg="Stopping and waiting for HTTP servers" caller="internal.go:533" context=kubernetes
time="2024-04-23T03:31:41Z" level=info msg="Wait completed, proceeding to shutdown the manager" caller="internal.go:537" context=kubernetes

Eventually, all contour pods will enter a CrashBackoffLoop and envoy is no longer able to serve traffic.

Having the option to disable leader election would help decouple contour from transient failures of the control-plane and improve reliability.

@usiegl00 usiegl00 added kind/feature Categorizes issue or PR as related to a new feature. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Apr 23, 2024
Copy link

Hey @usiegl00! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Contour. You can also join us on our mailing list and in our channel in the Kubernetes Slack Workspace

@sunjayBhatia
Copy link
Member

I would say this is actually intended behavior, since the leader facility is used to ensure status updates to resources are only ever written by a single contour instance at a time ensuring no conflicts in reported status, all contour instances continue to serve identical xDS configuration to the connected Envoys regardless of their individual leader status

At your own risk, take a look at the existing --disable-leader-election flag: https://projectcontour.io/docs/1.28/configuration/#serve-flags

@sunjayBhatia sunjayBhatia added kind/question Categorizes an issue as a user question. and removed kind/feature Categorizes issue or PR as related to a new feature. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels May 1, 2024
@usiegl00
Copy link
Author

usiegl00 commented May 2, 2024

Thank you for helping me with this issue.

Setting --disable-leader-election did not prevent contour from crashing when disconnected from the apiserver, and subsequently envoy as well.

Here are the new contour logs:

2024-05-02T01:46:58.231619836Z stderr F time="2024-05-02T01:46:58Z" level=info msg="started HTTP server" address="127.0.0.1:6060" context=debugsvc
2024-05-02T01:46:58.232177747Z stderr F time="2024-05-02T01:46:58Z" level=info msg="started HTTP server" address="0.0.0.0:8000" context=metricsvc
2024-05-02T01:46:58.232211774Z stderr F time="2024-05-02T01:46:58Z" level=info msg="waiting for the initial dag to be built" context=xds
2024-05-02T01:46:58.232225091Z stderr F time="2024-05-02T01:46:58Z" level=info msg="started status update handler" context=StatusUpdateHandler
2024-05-02T01:46:58.232373753Z stderr F time="2024-05-02T01:46:58Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=192.168.0.1
2024-05-02T01:46:58.333708921Z stderr F time="2024-05-02T01:46:58Z" level=info msg="performing delayed update" context=contourEventHandler last_update=102.148509ms outstanding=13
2024-05-02T01:46:58.434050577Z stderr F time="2024-05-02T01:46:58Z" level=info msg="the initial dag is built" context=xds
2024-05-02T01:46:58.435308924Z stderr F time="2024-05-02T01:46:58Z" level=info msg="started xDS server type: \"contour\"" address="0.0.0.0:8001" context=xds
2024-05-02T01:56:27.112494395Z stderr F time="2024-05-02T01:56:27Z" level=info msg="Stopping and waiting for non leader election runnables" caller="internal.go:516" context=kubernetes
2024-05-02T01:56:27.113164221Z stderr F time="2024-05-02T01:56:27Z" level=info msg="stopped event handler" context=contourEventHandler
2024-05-02T01:56:27.113273139Z stderr F time="2024-05-02T01:56:27Z" level=error msg="terminated HTTP server with error" context=metricsvc error="http: Server closed"
2024-05-02T01:56:27.113391957Z stderr F time="2024-05-02T01:56:27Z" level=error msg="error received after stop sequence was engaged" caller="internal.go:490" context=kubernetes error="http: Server closed"
2024-05-02T01:56:27.113495485Z stderr F time="2024-05-02T01:56:27Z" level=error msg="terminated HTTP server with error" context=debugsvc error="http: Server closed"
2024-05-02T01:56:27.113534012Z stderr F time="2024-05-02T01:56:27Z" level=error msg="error received after stop sequence was engaged" caller="internal.go:490" context=kubernetes error="http: Server closed"
2024-05-02T01:56:27.113785216Z stderr F time="2024-05-02T01:56:27Z" level=info msg="stopped xDS server" address="0.0.0.0:8001" context=xds
2024-05-02T01:56:27.113842337Z stderr F time="2024-05-02T01:56:27Z" level=info msg="Stopping and waiting for leader election runnables" caller="internal.go:520" context=kubernetes
2024-05-02T01:56:27.114009706Z stderr F time="2024-05-02T01:56:27Z" level=info msg="stopped status update handler" context=StatusUpdateHandler
2024-05-02T01:56:27.114057653Z stderr F time="2024-05-02T01:56:27Z" level=info msg="Stopping and waiting for caches" caller="internal.go:526" context=kubernetes
2024-05-02T01:56:27.11482561Z stderr F time="2024-05-02T01:56:27Z" level=info msg="Stopping and waiting for webhooks" caller="internal.go:530" context=kubernetes
2024-05-02T01:56:27.114853454Z stderr F time="2024-05-02T01:56:27Z" level=info msg="Stopping and waiting for HTTP servers" caller="internal.go:533" context=kubernetes
2024-05-02T01:56:27.114863261Z stderr F time="2024-05-02T01:56:27Z" level=info msg="Wait completed, proceeding to shutdown the manager" caller="internal.go:537" context=kubernetes

And these envoy messages appear when it loses connection to contour and stops serving requests:

2024-05-02T02:12:10.529184825Z stderr F [2024-05-02 02:12:10.529][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:193] StreamRuntime gRPC config stream to contour closed since 942s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection timeout
2024-05-02T02:12:10.529209759Z stderr F [2024-05-02 02:12:10.529][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:193] StreamSecrets gRPC config stream to contour closed since 942s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection timeout
2024-05-02T02:12:14.674745043Z stderr F [2024-05-02 02:12:14.674][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:193] StreamRoutes gRPC config stream to contour closed since 947s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111
2024-05-02T02:12:16.766855046Z stderr F [2024-05-02 02:12:16.766][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:193] StreamClusters gRPC config stream to contour closed since 949s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111
2024-05-02T02:12:18.159171337Z stderr F [2024-05-02 02:12:18.158][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:193] StreamListeners gRPC config stream to contour closed since 950s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111

Is there any flag for envoy to keep serving existing traffic upon losing connection to contour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Categorizes an issue as a user question.
Projects
None yet
Development

No branches or pull requests

2 participants