Autoscaler underprovisions for uneven low latency traffic #15000

Peilun-Li · 2024-03-11T22:32:14Z

Ask your question here:

Hi community, we have a potentially skewed low latency traffic targeting a CPU-bound knative service. With concurrency-based autoscaling, we are seeing a high p90+ latency. After we manually increase min-scale to an overprovisioned level, the p90+ latency goes back to the normal level. We suspect this might indicate an underprovision of autoscaler, and want to understand the reasons and explore potential solutions.

Hypothetical traffic pattern & example service settings:

We receive one request every 10ms. Plus, at the start tick of each second, we receive 10 requests in parallel.
The service is CPU-bound and can only process one request at a time (i.e. containerConcurrency=1). Additional requests have to wait in queue. Each request takes 10ms to process.

Expected behavior: autoscaler scales the service up to 11 (or higher considering the target utilization percentage)
Actual behavior: autoscaler underprovisions the service and higher p90+ latency.

We studied the autoscaler logic for concurrency based metric a bit and here's our understanding (definitely correct us if we are wrong): the way autoscaler tracks concurrency is actually AverageConcurrency). Using the above hypothetical traffic example, for each seconds:

// https://github.com/knative/serving/blob/main/vendor/knative.dev/networking/pkg/http/stats/request.go#L96-L104 
func (s *RequestStats) compute(now time.Time) {
	if durationSinceChange := now.Sub(s.lastChange); durationSinceChange > 0 {
		durationSecs := durationSinceChange.Seconds()
		s.secondsInUse += durationSecs // this will be 1 second after accumulation 
		s.computedConcurrency += s.concurrency * durationSecs // this will be 11*0.01+10*0.01+...+2*0.01+(1*0.01)*90=65*0.01+90*0.01=1.55
		s.computedProxiedConcurrency += s.proxiedConcurrency * durationSecs
		s.lastChange = now
	}
}

// https://github.com/knative/serving/blob/main/vendor/knative.dev/networking/pkg/http/stats/request.go#L144-L147
	if s.secondsInUse > 0 {
		report.AverageConcurrency = s.computedConcurrency / s.secondsInUse // this will be 1.55
		report.AverageProxiedConcurrency = s.computedProxiedConcurrency / s.secondsInUse
	}

With that (AverageConcurrency=1.55) it looks like autoscaler will try to scale up to 2, even if we have a peak concurrency of 11, i.e., autoscaler underprovisions if from the perspective of peak concurrency (but certainly makes sense for average concurrency)

Questions:

Is our above understanding correct?
I understand that average concurrency is desired in most cases in providing a good balance, but curious if there's any way in this case we can make it more reactive to such low-latency uneven traffic pattern. Ideally if we can have some toggle set on a per-service/revision basis to tune the sensitiveness of the concurrency metric, e.g., if with both average concurrency and peak concurrency reported, potentially a config ratio could help to tune autoscaling sensitiveness

autoscaler concurrency = (1-sensitivenss_ratio) * average_concurrency + sensitivenss_ratio * peak_concurrency

TIA for any insights and help!

The text was updated successfully, but these errors were encountered:

Peilun-Li added the kind/question Further information is requested label Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler underprovisions for uneven low latency traffic #15000

Autoscaler underprovisions for uneven low latency traffic #15000

Peilun-Li commented Mar 11, 2024 •

edited

Autoscaler underprovisions for uneven low latency traffic #15000

Autoscaler underprovisions for uneven low latency traffic #15000

Comments

Peilun-Li commented Mar 11, 2024 • edited

Ask your question here:

Peilun-Li commented Mar 11, 2024 •

edited