Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max connection limit #50413

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

daimaxiaxie
Copy link

@daimaxiaxie daimaxiaxie commented Apr 12, 2024

Please provide a description of this PR:

Control Plane load imbalance, such as instance failure, network disconnection. Both may cause a single instance burst and overload. Final avalanche(All instances start and then crash). #50412

Therefore there is an upper limit on connections for a single control plane instance to prevent overload.

env PILOT_MAX_CONNECTION = 0, Limits the number of incoming ADS connection. If set to 0 or unset, disabled the feature.

@istio-policy-bot
Copy link

😊 Welcome @daimaxiaxie! This is either your first contribution to the Istio istio repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 12, 2024
@istio-testing
Copy link
Collaborator

Hi @daimaxiaxie. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zirain
Copy link
Member

zirain commented Apr 12, 2024

/ok-to-test

@istio-testing istio-testing added ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. and removed needs-ok-to-test labels Apr 12, 2024
@wulianglongrd
Copy link
Member

I don't think there is much value in limiting the total number of connections. If there are too many connections, it will only affect the speed of configuration push.

If you just want to prevent the impact of a sudden increase in traffic, you can use PILOT_MAX_REQUESTS_PER_SECOND to limit the request.

@wulianglongrd
Copy link
Member

Also, can we use DiscoveryServer.adsClients to count the number of connections?

@@ -49,10 +49,17 @@ var (
"pilot_info",
"Pilot version and build information.",
)

connectionTotal = monitoring.NewGauge(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate with

xdsClients = monitoring.NewGauge(

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think their meanings are different. xdsClients represents how many xds streams. connectionTotal represents the number of grpc sessions.

What do you think?

@hzxuzhonghu
Copy link
Member

It makes some sense, the imbalanced connections could cause one pilot overloaded, costing too much more mem/cpu than the others.

defer func() {
connectionTotal.RecordInt(s.connectionCounter.Add(-1))
}()
if features.ConnectionLimit > 0 && int64(features.ConnectionLimit) < current {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use that DiscoveryServer.adsClientCount

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiscoveryServer.adsClients change in stream handler(initConnection). It is not synchronized with grpc's dial. I think this will cause some problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need such precise limits? The grpc's dial time should be very short, and PILOT_MAX_REQUESTS_PER_SECOND can limit burst connections. This PR is just to limit the continued increase in connections, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is just to limit the continued increase in connections, right?

Yes.
There is a gap between initConnection and grpc.dial. If use DiscoveryServer.adsClients, ,this feature will invalid in extreme cases. I feel it's better to be more precise. What do you think about that? Thanks.

@daimaxiaxie
Copy link
Author

I don't think there is much value in limiting the total number of connections. If there are too many connections, it will only affect the speed of configuration push.

Too many connections lead to a heavy memory load, even OOM. @wulianglongrd

And this sudden increase in traffic is different. It sets a overall service cap for a control plane instance.

@wulianglongrd
Copy link
Member

Makes sense. Addendum: There is also an existing counter here, which is used by the pilot_xds indicator.

xdsClientTracker = make(map[string]float64)

@daimaxiaxie
Copy link
Author

Also, can we use DiscoveryServer.adsClients to count the number of connections?

No, DiscoveryServer.adsClients change in stream handler. It is not synchronized with grpc's dial time.

connectionTotal.RecordInt(s.connectionCounter.Add(-1))
}()
if features.ConnectionLimit > 0 && int64(features.ConnectionLimit) < current {
return grpcstatus.Errorf(codes.ResourceExhausted, "connection limit exceeded")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly unsafe. Unless you have a fixed number of istiod pods and clients, having hard limits around connections without having autoscaling tied to this limit is a recipe to trigger an outage where all connections are denied. It also makes a DOS attack trivial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Curious how much memory you have allocated for Istiod and how many total clients you have + and how many clients connected to single Istiod caused Istiod to get OOMKilled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our test cluster, each instance is limited to 16c40g.(replicas 10+, fixed number) Each instance usually maintains more than 700 connections, and the memory usage is 22g-26g. When the cluster is thrashing (some failures), it is easy for a single instance to reach the memory limit. Our production cluster has many more clients.

So I think: when the resources of an instance are fixed, the upper limit of clients it can serve is also fixed.
This feature(pr) is disabled by default

@istio-policy-bot
Copy link

🧭 This issue or pull request has been automatically marked as stale because it has not had activity from an Istio team member since 2024-04-12. It will be closed on 2024-05-27 unless an Istio team member takes action. Please see this wiki page for more information. Thank you for your contributions.

Created by the issue and PR lifecycle manager.

@istio-policy-bot istio-policy-bot added the lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user experience lifecycle/stale Indicates a PR or issue hasn't been manipulated by an Istio team member for a while ok-to-test Set this label allow normal testing to take place for a PR not submitted by an Istio org member. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants