You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a canary for a deployment doing high rps is promoted to primary, it fails because it doesn't have enough replicas to handle the load.
To Reproduce
Apply a canary object for a high RPS deployment with stepWeight: 3, interval: 1 (3% every min)
Have a somewhat large difference between min and max for HPA. Eg. minReplicas: "3", maxReplicas: "30"
Observe the deployment
Expected behavior
The deployment should succeed
Actual behavior
Canary traffic shift happens successfully in around half an hour since it's 3% every minute
Canary deployment is scaled up by hpa as more traffic is shifted.
At the same time primary deployment is scaled down by the hpa for the same reason.
Canary is promoted to primary.
New primary fails because it can't handle the load.
Canary is stuck in Finalising state
This is because the new primary deployment replicas is only set when hpa ref is nil. This means the new primary deployment replica count will be set to hpa's min and since this is a small value, it cannot handle the load.
Additional context
Flagger version: 1.34
Kubernetes version: 1.27
Service Mesh provider: istio
Ingress provider: istio
Workarounds
Adjust stepWeightPromotion to make sure it does a partial traffic shift - Since this already done as part of canary, it seems redundant
Don't have low value of hpa min - This won't be ideal for workloads whose non peak traffic is low resulting in waste of resources
If stepWeightPromotion: 100 (or have another variable like promotionReplicas), primary replicas should be set to canary replicas - This seems logical but not sure how the hpa will react.
The text was updated successfully, but these errors were encountered:
shysank
changed the title
Thundering herd fails primary during finalising
Thundering herd causes canary to be stuck finalising
Feb 21, 2024
shysank
changed the title
Thundering herd causes canary to be stuck finalising
Thundering herd causes canary to be stuck in finalising
Feb 21, 2024
@stefanprodan Thanks for the response. stepWeightPromotion is what we're planning to do. Unfortunately a side effect of this is, deployment time doubles for the same strategy without much benefits since we already know that the new build works, and even if didn't we cannot rollback at this point to an older deployment.
Would it make sense to set the new primary replicas to be the same as canary when stepweightPromotion is 100? Or have another variable like primaryReplicas? Happy to work on a patch if either of these makes sense.
Describe the bug
When a canary for a deployment doing high rps is promoted to primary, it fails because it doesn't have enough replicas to handle the load.
To Reproduce
stepWeight: 3, interval: 1 (3% every min)
minReplicas: "3", maxReplicas: "30"
Expected behavior
The deployment should succeed
Actual behavior
Finalising
stateThis is because the new primary deployment replicas is only set when hpa ref is nil. This means the new primary deployment replica count will be set to hpa's min and since this is a small value, it cannot handle the load.
Additional context
Workarounds
stepWeightPromotion
to make sure it does a partial traffic shift - Since this already done as part of canary, it seems redundantstepWeightPromotion: 100 (or have another variable like promotionReplicas)
, primary replicas should be set to canary replicas - This seems logical but not sure how the hpa will react.The text was updated successfully, but these errors were encountered: