Unwanted scale-down during sustained load triggers an eventual explosion in replicas #15121

DavidR91 · 2024-04-15T12:11:00Z

In what area(s)?

/area autoscale

What version of Knative?

1.9.2

Expected Behavior

Under sustained load for 15-20 minutes to the same service, the service should remain operational with minimal fluctuation in pod / replica count.

(The service in question is extremely simple and does very little work and does not incur any errors of its own)

Actual Behavior

Under sustained load as part of a soak load test (~6,000 RPS to a trivial POST for ~30 minutes) we witness the following:

The replica count grows to accommodate the incoming requests, as expected
The replica count remains mostly stable for 10-15 minutes and all traffic is served correctly
Slowly, the autoscaler starts to cut back on 'desired pods' for the service - for seemingly no reason. The traffic is still the same as before
This eventually results in a kind of chain reaction, where the number of replicas is now too low to service the volume of traffic. The remaining replicas can't cope (one of the pods' queue proxies actually OOMs and exits), and the autoscaler enters panic and boosts the replica count to a suddenly high value (in our tests this went from 2-3 pods to 54)

Exact time points for above graphs:

Test starts at 10:00
The Desired and Wanted pods is cut from 4 to 3 by 10:16:21.122
The Excess Burst Capacity is logged as deficient (-19) in the autoscaler at 10:16:25.078
One of the replicas' queue proxies OOMs and exits at 10:16:26 (having had a steep increase in memory from 10:16:21 onward when the pod count was cut)
The Actual Pods drops to 2 (observed at 10:16:28)
Panic mode is entered at 10:16:29.078 with an observed excess capacity of -230 and the wanted pods is set to 7
Malfunction then occurs with the wanted pods continuing to increase over time as the observed EBC has not improved

Steps to Reproduce the Problem

We are knative 1.9.2 with net-istio. With the exception of boosting the resources for activator/autoscaler, we are using the default configuration for all components with the exception of the following settings in autoscaler configmap:

max-scale-down-rate: "1.05"                                                                                                                                                          
scale-down-delay: 5m

The service is also scaled from zero.

We have seen this type of behaviour (replica/pod explosions) several times previously, although this is the first controlled tests where we have reportable metrics and timestamps we can share

The text was updated successfully, but these errors were encountered:

skonto · 2024-04-26T11:12:21Z

@DavidR91 hi, I will try to reproduce could you also paste/attach your logs from the autoscaler side with debug enabled?
I am looking for statements like: "Delaying scale to 0, staying at X".

DavidR91 · 2024-05-01T08:28:37Z

Attached an autoscaler log in debug. This is less of a less dramatic scale down than mentioned above but I think still a valid repro
export.csv

The log starts at the point just after a load spike switches into a load 'soak' for 30 minutes at ~6k RPS (although the full 30 minutes are not included).

Notable is that ~7:51:40 is the point just after a pod is removed where request durations spike upward as a result (which coincides with a scale from 8 to 7 in the log):

(Charts are in UTC so 8:51 is the relevant time below):

Load test

DavidR91 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 15, 2024

knative-prow bot added the area/autoscale label Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unwanted scale-down during sustained load triggers an eventual explosion in replicas #15121

Unwanted scale-down during sustained load triggers an eventual explosion in replicas #15121

DavidR91 commented Apr 15, 2024

skonto commented Apr 26, 2024

DavidR91 commented May 1, 2024 •

edited

Unwanted scale-down during sustained load triggers an eventual explosion in replicas #15121

Unwanted scale-down during sustained load triggers an eventual explosion in replicas #15121

Comments

DavidR91 commented Apr 15, 2024

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

skonto commented Apr 26, 2024

DavidR91 commented May 1, 2024 • edited

DavidR91 commented May 1, 2024 •

edited