New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max connection limit #50413
base: master
Are you sure you want to change the base?
Add max connection limit #50413
Conversation
…dd-max-connection-limit
😊 Welcome @daimaxiaxie! This is either your first contribution to the Istio istio repo, or it's been You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines Thanks for contributing! Courtesy of your friendly welcome wagon. |
Hi @daimaxiaxie. Thanks for your PR. I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
I don't think there is much value in limiting the total number of connections. If there are too many connections, it will only affect the speed of configuration push. |
Also, can we use |
@@ -49,10 +49,17 @@ var ( | |||
"pilot_info", | |||
"Pilot version and build information.", | |||
) | |||
|
|||
connectionTotal = monitoring.NewGauge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate with
istio/pilot/pkg/xds/monitoring.go
Line 75 in 3e427fc
xdsClients = monitoring.NewGauge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think their meanings are different. xdsClients represents how many xds streams. connectionTotal represents the number of grpc sessions.
What do you think?
It makes some sense, the imbalanced connections could cause one pilot overloaded, costing too much more mem/cpu than the others. |
defer func() { | ||
connectionTotal.RecordInt(s.connectionCounter.Add(-1)) | ||
}() | ||
if features.ConnectionLimit > 0 && int64(features.ConnectionLimit) < current { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use that DiscoveryServer.adsClientCount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DiscoveryServer.adsClients
change in stream handler(initConnection). It is not synchronized with grpc's dial. I think this will cause some problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need such precise limits? The grpc's dial time should be very short, and PILOT_MAX_REQUESTS_PER_SECOND
can limit burst connections. This PR is just to limit the continued increase in connections, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is just to limit the continued increase in connections, right?
Yes.
There is a gap between initConnection and grpc.dial. If use DiscoveryServer.adsClients, ,this feature will invalid in extreme cases. I feel it's better to be more precise. What do you think about that? Thanks.
Too many connections lead to a heavy memory load, even OOM. @wulianglongrd And this sudden increase in traffic is different. It sets a overall service cap for a control plane instance. |
Makes sense. Addendum: There is also an existing counter here, which is used by the istio/pilot/pkg/xds/monitoring.go Line 80 in 3e427fc
|
No, |
connectionTotal.RecordInt(s.connectionCounter.Add(-1)) | ||
}() | ||
if features.ConnectionLimit > 0 && int64(features.ConnectionLimit) < current { | ||
return grpcstatus.Errorf(codes.ResourceExhausted, "connection limit exceeded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredibly unsafe. Unless you have a fixed number of istiod pods and clients, having hard limits around connections without having autoscaling tied to this limit is a recipe to trigger an outage where all connections are denied. It also makes a DOS attack trivial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Curious how much memory you have allocated for Istiod and how many total clients you have + and how many clients connected to single Istiod caused Istiod to get OOMKilled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our test cluster, each instance is limited to 16c40g.(replicas 10+, fixed number) Each instance usually maintains more than 700 connections, and the memory usage is 22g-26g. When the cluster is thrashing (some failures), it is easy for a single instance to reach the memory limit. Our production cluster has many more clients.
So I think: when the resources of an instance are fixed, the upper limit of clients it can serve is also fixed.
This feature(pr) is disabled by default
🧭 This issue or pull request has been automatically marked as stale because it has not had activity from an Istio team member since 2024-04-12. It will be closed on 2024-05-27 unless an Istio team member takes action. Please see this wiki page for more information. Thank you for your contributions. Created by the issue and PR lifecycle manager. |
Please provide a description of this PR:
Control Plane load imbalance, such as instance failure, network disconnection. Both may cause a single instance burst and overload. Final avalanche(All instances start and then crash). #50412
Therefore there is an upper limit on connections for a single control plane instance to prevent overload.