You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under sustained load for 15-20 minutes to the same service, the service should remain operational with minimal fluctuation in pod / replica count.
(The service in question is extremely simple and does very little work and does not incur any errors of its own)
Actual Behavior
Under sustained load as part of a soak load test (~6,000 RPS to a trivial POST for ~30 minutes) we witness the following:
The replica count grows to accommodate the incoming requests, as expected
The replica count remains mostly stable for 10-15 minutes and all traffic is served correctly
Slowly, the autoscaler starts to cut back on 'desired pods' for the service - for seemingly no reason. The traffic is still the same as before
This eventually results in a kind of chain reaction, where the number of replicas is now too low to service the volume of traffic. The remaining replicas can't cope (one of the pods' queue proxies actually OOMs and exits), and the autoscaler enters panic and boosts the replica count to a suddenly high value (in our tests this went from 2-3 pods to 54)
Exact time points for above graphs:
Test starts at 10:00
The Desired and Wanted pods is cut from 4 to 3 by 10:16:21.122
The Excess Burst Capacity is logged as deficient (-19) in the autoscaler at 10:16:25.078
One of the replicas' queue proxies OOMs and exits at 10:16:26 (having had a steep increase in memory from 10:16:21 onward when the pod count was cut)
The Actual Pods drops to 2 (observed at 10:16:28)
Panic mode is entered at 10:16:29.078 with an observed excess capacity of -230 and the wanted pods is set to 7
Malfunction then occurs with the wanted pods continuing to increase over time as the observed EBC has not improved
Steps to Reproduce the Problem
We are knative 1.9.2 with net-istio. With the exception of boosting the resources for activator/autoscaler, we are using the default configuration for all components with the exception of the following settings in autoscaler configmap:
max-scale-down-rate: "1.05"
scale-down-delay: 5m
The service is also scaled from zero.
We have seen this type of behaviour (replica/pod explosions) several times previously, although this is the first controlled tests where we have reportable metrics and timestamps we can share
The text was updated successfully, but these errors were encountered:
@DavidR91 hi, I will try to reproduce could you also paste/attach your logs from the autoscaler side with debug enabled?
I am looking for statements like: "Delaying scale to 0, staying at X".
Attached an autoscaler log in debug. This is less of a less dramatic scale down than mentioned above but I think still a valid repro export.csv
The log starts at the point just after a load spike switches into a load 'soak' for 30 minutes at ~6k RPS (although the full 30 minutes are not included).
Notable is that ~7:51:40 is the point just after a pod is removed where request durations spike upward as a result (which coincides with a scale from 8 to 7 in the log):
(Charts are in UTC so 8:51 is the relevant time below):
In what area(s)?
/area autoscale
What version of Knative?
1.9.2
Expected Behavior
Under sustained load for 15-20 minutes to the same service, the service should remain operational with minimal fluctuation in pod / replica count.
(The service in question is extremely simple and does very little work and does not incur any errors of its own)
Actual Behavior
Under sustained load as part of a soak load test (~6,000 RPS to a trivial POST for ~30 minutes) we witness the following:
The replica count grows to accommodate the incoming requests, as expected
The replica count remains mostly stable for 10-15 minutes and all traffic is served correctly
Slowly, the autoscaler starts to cut back on 'desired pods' for the service - for seemingly no reason. The traffic is still the same as before
This eventually results in a kind of chain reaction, where the number of replicas is now too low to service the volume of traffic. The remaining replicas can't cope (one of the pods' queue proxies actually OOMs and exits), and the autoscaler enters panic and boosts the replica count to a suddenly high value (in our tests this went from 2-3 pods to 54)
Exact time points for above graphs:
10:00
10:16:21.122
10:16:25.078
10:16:26
(having had a steep increase in memory from 10:16:21 onward when the pod count was cut)10:16:28
)10:16:29.078
with an observed excess capacity of -230 and the wanted pods is set to 7Steps to Reproduce the Problem
We are knative 1.9.2 with net-istio. With the exception of boosting the resources for activator/autoscaler, we are using the default configuration for all components with the exception of the following settings in autoscaler configmap:
The service is also scaled from zero.
We have seen this type of behaviour (replica/pod explosions) several times previously, although this is the first controlled tests where we have reportable metrics and timestamps we can share
The text was updated successfully, but these errors were encountered: