Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outlier detection support in DestinationRule for grpc proxyless #3000

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

aditya-32
Copy link

@aditya-32 aditya-32 commented Nov 24, 2023

Closes #3001

@aditya-32 aditya-32 requested a review from a team as a code owner November 24, 2023 15:30
@istio-policy-bot
Copy link

😊 Welcome @aditya-32! This is either your first contribution to the Istio api repo, or it's been
a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines
by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

Copy link

linux-foundation-easycla bot commented Nov 24, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-ok-to-test labels Nov 24, 2023
@istio-testing
Copy link
Collaborator

Hi @aditya-32. Thanks for your PR.

I'm waiting for a istio member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aditya-32
Copy link
Author

/easycla

@aditya-32 aditya-32 force-pushed the outlierDetectionSupportForGrpcProxyless branch 2 times, most recently from b9abfe4 to 73bbc00 Compare November 26, 2023 05:12
@aditya-32 aditya-32 force-pushed the outlierDetectionSupportForGrpcProxyless branch from 73bbc00 to 0b22a81 Compare November 26, 2023 05:13
Comment on lines 952 to 956
FailurePercentageEjection failure_percentage_ejection = 10;

// SuccessRateEjection Algorithm for Grpc-xds proxyless for deciding
// if the host is an outlier or not.
SuccessRateEjection success_rate_ejection = 11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is only for grpc not sidecar.
How does it collaborate with the existing configs like Consecutive_5Xx? Does it have a limitation only some of these can be set.

Copy link
Author

@aditya-32 aditya-32 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently istio provides outlierDetection which is supported by envoy proxy only. But these FailurePercent and SuccessRate are not supported for both grpc-xds and envoy.

Does it have a limitation only some of these can be set?
regarding this , in order to trigger outlierDetection we need to set properties if either of these algorithm

reference for previous discussion:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does envoy not support the same ones?

Copy link
Author

@aditya-32 aditya-32 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Envoy has support envoy outlier detection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but they do apply to envoy as well, right? if so, we shouldn't say its "for Grpc-xds proxyless", as its not. Its for any proxy type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot get the answer, can you reply positively

Copy link
Author

@aditya-32 aditya-32 Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzxuzhonghu envoy proto for outlierDetection has those parameters for FailurePercent and SuccessRate algorithm, but currently we don't have a way to configure it from istio. Only way to configure those for application using envoy-proxy is through EnvoyFilter. But our use case is for grpc proxyless application(no envoy sidecar proxy) , and istio does not support a way to configure those Algorithm parameters. Also currently there is no way to configure outlierDetection for grpc-proxyless application.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn @hzxuzhonghu let me answer these questions on @aditya-32 behalf.

Q. Does envoy support these parameters?
Yes

Q. How does it collaborate with the existing configs like Consecutive_5XX?

From Success Rate and Failure Percentage

There is no mention of how they are able to co-relate with the consecutive_5xx, only mention are of the configurations like split_external_local_origin_errors and few of the configurations.

In this scenario, I believe that it is for the envoy / grpc to handle how they would want to process the configuration and istio should help with translation and leave the client and the user to understand what combination works, because there will always be few internal implementations which are not documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my suggestion is to make them explicit, what does each option control and how will they collaborate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzxuzhonghu can you please re-review, made the changes

// interval duration above) to perform failure percentage-based ejection for this address. If the
// volume is lower than this setting, failure percentage-based ejection will not be performed for
// this host.
google.protobuf.UInt32Value request_volume = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is a sliding window or not

Copy link
Author

@aditya-32 aditya-32 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the request for the host should be atleast request_volume count within the interval period in order to be eligible for outlierDetection

Comment on lines 952 to 956
FailurePercentageEjection failure_percentage_ejection = 10;

// SuccessRateEjection Algorithm for Grpc-xds proxyless for deciding
// if the host is an outlier or not.
SuccessRateEjection success_rate_ejection = 11;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does envoy not support the same ones?

// The % chance that an address will be actually ejected when an outlier status is detected through
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up
// slowly.
google.protobuf.UInt32Value enforcement_percentage = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? I am struggling to see the use case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enforcement_percentage this is like what is the probability for a host to be removed from LB pool when detected as outlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what it is, but I am not clear why we need to offer it or why a user would want to set it. "It exists in the Envoy API" is not a reason why, either -- Envoy's API has different design principals than Istio

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn I think I understand the use-case here, lets say that there are 500 servers and 50 of them started showing ejection behaviour. Then the outlier detection algorithm will eject all these 50 servers and distribute its load on rest of the 450 servers, this might be a big traffic shift for other servers to handle.

On the contrary, having this setting to lets say 50 will eliminate 25 and then 12 and so on, such that the traffic is ramped up in corresponding intervals.

Although none of the documentations (both enovy / grpc-xds) clearly mentions in which scenario this is applicable. But from what I see it may help prevent issues with temporary spikes and help smoothen out the curve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - the documentation on this PR is also pretty unclear. My concern is that reading either doc - I wouldn't know how to use it, and most likely people will use it wrong or not take advantage if it is actually needed - on an already over-complex system.

I understand there is a need - and envoy and proxyless may support this - but if we go down the path of supporting each and every envoy field we may be better off with just adopting Envoy API ( which we already do in EnvoyFilter - with well known problems ).

// The minimum number of addresses in order to perform failure percentage-based ejection.
// If the total number of addresses is less than this value, failure percentage-based
// ejection will not be performed.
google.protobuf.UInt32Value minimum_hosts = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this in the context of Istio? How do we expect users to set it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn for this and the remaining fields, I am a little confused when we say in the context of istio, I would like to better understand how do we decide which properties do we need to translate to data plane like envoy / grpc-xds, because istio is itself might not be doing any action here excpet for providing the configurations and if a user would like to configure these things for their setup, how would one ensure that they are made available to them via the destination rule otherwise?

I would say these are these parameters are more or less tuning parameters and may not be left as well left to default values (which is 5), because they would otherwise not work for the cases where the service is having less than 5 hosts. They are better tuned as per the traffic patterns and the use cases of the user.

I would have given better answer if I would have used it on staging / production setup and tried and tested it, but to test it we would as well be needing these features in istio.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally users try it with EnvoyFilter - adding it to a beta API just to test it out and figure out that only 1 or 2 users understand how to use it - but we have to maintain it for a long time is not ideal.

I know John has a different opinion on new CRDs - but given that Ambient will have particular troubles with DestinationRule ( per-node ztunnel won't apply them - and waypoints can't respect the full semantics ) - this sounds like something that would be much better off in a separate CRD, like CircuitBreaking, with the new attachment model.

DR is in particular tricky because it can be used both client side and server side. It is unreasonable to expect clients to know how many servers are running ( and all of them to update when this change). With auto-scaling - I doubt even the producer can know this. Some of those settings would be far better left to Istiod to compute.

// interval duration above) to perform failure percentage-based ejection for this address. If the
// volume is lower than this setting, failure percentage-based ejection will not be performed for
// this host.
google.protobuf.UInt32Value request_volume = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this in the context of Istio? How do we expect users to set it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn make sense will check and update regarding the fields priority and usecase!

Copy link
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least this will need a LOT more docs and examples. Adding an API that even Istio reviewers have trouble understanding leaves little chance to regular users.

I don't disagree that some of it may be useful - but also not clear if this is something the user should configure or Istiod should set, knowing the current number of endpoints.

Another concern is that I don't understand how this relates to traffic shifting and versioning - if I deploy v2 and all the new endpoints are outliers, how would the setting affect this ?

In any case - at the very least the new fields should be annotated with whatever 'experimental' we use to indicate they are NOT part of the beta API and are very experimental. Also a LOT of docs on how this will work with ambient ( we still share many APIs, and for new stuff we should document if it applies or not, where it applies and how the workload selector vs other composition rules are changed)

// The % chance that an address will be actually ejected when an outlier status is detected through
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up
// slowly.
google.protobuf.UInt32Value enforcement_percentage = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - the documentation on this PR is also pretty unclear. My concern is that reading either doc - I wouldn't know how to use it, and most likely people will use it wrong or not take advantage if it is actually needed - on an already over-complex system.

I understand there is a need - and envoy and proxyless may support this - but if we go down the path of supporting each and every envoy field we may be better off with just adopting Envoy API ( which we already do in EnvoyFilter - with well known problems ).

// The minimum number of addresses in order to perform failure percentage-based ejection.
// If the total number of addresses is less than this value, failure percentage-based
// ejection will not be performed.
google.protobuf.UInt32Value minimum_hosts = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally users try it with EnvoyFilter - adding it to a beta API just to test it out and figure out that only 1 or 2 users understand how to use it - but we have to maintain it for a long time is not ideal.

I know John has a different opinion on new CRDs - but given that Ambient will have particular troubles with DestinationRule ( per-node ztunnel won't apply them - and waypoints can't respect the full semantics ) - this sounds like something that would be much better off in a separate CRD, like CircuitBreaking, with the new attachment model.

DR is in particular tricky because it can be used both client side and server side. It is unreasonable to expect clients to know how many servers are running ( and all of them to update when this change). With auto-scaling - I doubt even the producer can know this. Some of those settings would be far better left to Istiod to compute.

@@ -972,7 +1037,7 @@ message OutlierDetection {
// for connections to upstream database cluster.
//
// {{<tabset category-name="example">}}
// {{<tab name="v1alpha3" category-value="v1alpha3">}}
// {{<tab name="v1alpha3" category-value="v1lpha3">}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is accidental.

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Dec 17, 2023
@istio-testing
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@akamac
Copy link

akamac commented Jan 12, 2024

I'd love to see it implemented as we are adopting proxyless setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ok-to-test needs-rebase Indicates a PR needs to be rebased before being merged size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support of OutlierDetection in DestinationRule for Grpc-Proxyless
8 participants