Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwanted scale-down during sustained load triggers an eventual explosion in replicas #15121

Open
DavidR91 opened this issue Apr 15, 2024 · 2 comments
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.

Comments

@DavidR91
Copy link

In what area(s)?

/area autoscale

What version of Knative?

1.9.2

Expected Behavior

Under sustained load for 15-20 minutes to the same service, the service should remain operational with minimal fluctuation in pod / replica count.

(The service in question is extremely simple and does very little work and does not incur any errors of its own)

Actual Behavior

Under sustained load as part of a soak load test (~6,000 RPS to a trivial POST for ~30 minutes) we witness the following:

  • The replica count grows to accommodate the incoming requests, as expected

  • The replica count remains mostly stable for 10-15 minutes and all traffic is served correctly

  • Slowly, the autoscaler starts to cut back on 'desired pods' for the service - for seemingly no reason. The traffic is still the same as before

  • This eventually results in a kind of chain reaction, where the number of replicas is now too low to service the volume of traffic. The remaining replicas can't cope (one of the pods' queue proxies actually OOMs and exits), and the autoscaler enters panic and boosts the replica count to a suddenly high value (in our tests this went from 2-3 pods to 54)

image

Exact time points for above graphs:

  • Test starts at 10:00
  • The Desired and Wanted pods is cut from 4 to 3 by 10:16:21.122
  • The Excess Burst Capacity is logged as deficient (-19) in the autoscaler at 10:16:25.078
  • One of the replicas' queue proxies OOMs and exits at 10:16:26 (having had a steep increase in memory from 10:16:21 onward when the pod count was cut)
  • The Actual Pods drops to 2 (observed at 10:16:28)
  • Panic mode is entered at 10:16:29.078 with an observed excess capacity of -230 and the wanted pods is set to 7
  • Malfunction then occurs with the wanted pods continuing to increase over time as the observed EBC has not improved

Steps to Reproduce the Problem

We are knative 1.9.2 with net-istio. With the exception of boosting the resources for activator/autoscaler, we are using the default configuration for all components with the exception of the following settings in autoscaler configmap:

max-scale-down-rate: "1.05"                                                                                                                                                          
scale-down-delay: 5m 

The service is also scaled from zero.

We have seen this type of behaviour (replica/pod explosions) several times previously, although this is the first controlled tests where we have reportable metrics and timestamps we can share

@DavidR91 DavidR91 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 15, 2024
@skonto
Copy link
Contributor

skonto commented Apr 26, 2024

@DavidR91 hi, I will try to reproduce could you also paste/attach your logs from the autoscaler side with debug enabled?
I am looking for statements like: "Delaying scale to 0, staying at X".

@DavidR91
Copy link
Author

DavidR91 commented May 1, 2024

Attached an autoscaler log in debug. This is less of a less dramatic scale down than mentioned above but I think still a valid repro
export.csv

The log starts at the point just after a load spike switches into a load 'soak' for 30 minutes at ~6k RPS (although the full 30 minutes are not included).

Notable is that ~7:51:40 is the point just after a pod is removed where request durations spike upward as a result (which coincides with a scale from 8 to 7 in the log):

(Charts are in UTC so 8:51 is the relevant time below):

autoscaler

Load test
load-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants