You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
defining an on-CPU time quota over a given wallclock period
once a cgroup exceeds its quota within a period, it will not be scheduled at
all until the next period
so the more CPU thirsty your workload is, the more you have to be mindful
about staying within its time quota, or suffer potentially large latency
artifacts due to being descheduled
Aurae uses CFS throttling to enforce cell CPU time quotes, similarly to Docker
as described in the various backgound articles above.
However it's made the interesting to choice to hide the CFS throttling period,
exposing only a max time quota field in its API. Furthermore, Aurae has
hardcoded the CFS period to be 1s, which is 10x its typical default value of
100ms.
Problem: Large Latency Artifacts
The primary problem with how Aurae's CPU quotas currently is large latency artifacts:
once a cell has spent its max CPU time, none of its running processes will be
scheduled for the rest of an entire second
let's use the common 400ms setting used in many of the examples, and assume a
mundane 100% CPU utilization (1 OS CPU core): after running for 400ms, the
cell then pauses for 600ms
the problem only gets worse the more parallel a cell's workload is: if the
cell has 4 hot/running kernel tasks (threads, processes, VMs, whatever) then
the cell will run for 100ms then pause for 900ms
See the example section below for code and excerpt data exploring this effect.
In the case of a request processing service, these are SLO breaking levels of latency.
In fact, the typical 100ms CFS period is already material to such things.
Having even larger latency artifacts, now measured in the 600ms-900ms range,
might even be bad enough to affect things like health checks and cluster
traffic routing systems.
Proposal: at least expose CFS period ; maybe lower its default
At the very least, in my view, Aurae need to expose the CFS period alongside
max CPU time.
I'm less convinced about lowering the default:
to me Aurae's choice to directly expose CFS time quota, rather than a more
abstracted percentage or logical core count is a laudable one
to that end, I feel that users (especially ones for whom a letency SLO is a
thing) should have to answer directly a question like "When my cell saturates
its CPU quota, what is the time penalty we're willing to pay for that?"
to that end, I'd prefer to see every cell CPU allocation specifying both
parts; but if we're not going to formally require it, having a default period
so large as to be unacceptable for anyone's SLO would also do, since in
practice it'll force everyone to specify a period eventually
Example Code / Data
To confirm my own recollection of how this all works, and allow easy
reproduction by others, I've written some examples programs in #406 :
2 examples programs that burn as much CPU as they can as a saturation test
the Node.JS one can only burn (a bit more than) 1 CPU core due to how
Node/JavaScript work currently
the Go version can burn as many cores as you've got, and indeed, one must
specify GOMAXPROCS when running it, since the Go runtime still is not
container aware without a 3rd party library
the supporting auraescript runner is a bit hacky, but gets the job done to
demonstrate CPU quota saturation today; see it's // NOTE comment after its cells.start call for instructions.
Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota
After running for around 30 seconds, the node example program experiences 600-700ms latency excursions:
Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota
Here's similar result from running an analogous Go program for around 30 seconds:
2023/02/14 15:58:51 [lag report] min:27.89854ms max:472.131789ms box:[ 99.880519ms 99.992654ms 100.151136ms ] hi:101.992654ms hiOutliers:12 12.0%
2023/02/14 15:58:51 {Start:2023-02-14 15:58:39.214184729 -0500 EST m=+20.401723712 End:2023-02-14 15:58:39.485175472 -0500 EST m=+20.672714508 Actual:270.990796ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:49.014233509 -0500 EST m=+30.201772457 End:2023-02-14 15:58:49.485037021 -0500 EST m=+30.672576095 Actual:470.803638ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:45.014041338 -0500 EST m=+26.201580281 End:2023-02-14 15:58:45.485087655 -0500 EST m=+26.672626688 Actual:471.046407ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:51.01440381 -0500 EST m=+32.201942761 End:2023-02-14 15:58:51.485465242 -0500 EST m=+32.673004332 Actual:471.061571ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:43.014046723 -0500 EST m=+24.201585667 End:2023-02-14 15:58:43.48512526 -0500 EST m=+24.672664308 Actual:471.078641ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:44.01403297 -0500 EST m=+25.201571914 End:2023-02-14 15:58:44.485149938 -0500 EST m=+25.672688966 Actual:471.117052ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:48.01406235 -0500 EST m=+29.201601293 End:2023-02-14 15:58:48.48522296 -0500 EST m=+29.672762000 Actual:471.160707ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:50.014036425 -0500 EST m=+31.201575373 End:2023-02-14 15:58:50.485200964 -0500 EST m=+31.672739934 Actual:471.164561ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:40.014035733 -0500 EST m=+21.201574682 End:2023-02-14 15:58:40.485247383 -0500 EST m=+21.672786397 Actual:471.211715ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:42.014066621 -0500 EST m=+23.201605565 End:2023-02-14 15:58:42.485391197 -0500 EST m=+23.672930199 Actual:471.324634ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:46.014037924 -0500 EST m=+27.201576868 End:2023-02-14 15:58:46.485420931 -0500 EST m=+27.672959951 Actual:471.383083ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:47.014097749 -0500 EST m=+28.201636694 End:2023-02-14 15:58:47.486229431 -0500 EST m=+28.673768483 Actual:472.131789ms Expected:100ms}
Here the actual encountered a little lower since the CPU quota is a little less
oversubscribed; also the low end of the box stat may seem surprising, but is an
artifact of how a constant-interval Go ticker behaves after encountering runtime
lag; in other words, after coming out of a pause section, it delivered a couple
ticks in rapid succession.
this was completely my decision to hide the max, thinking it would make it simpler to reason about for end users, but you're right that it also limits flexibility and more throttling comes at a significant cost.
Background: Linux CFS Throttling
In short, CFS CPU throttling works by:
all until the next period
about staying within its time quota, or suffer potentially large latency
artifacts due to being descheduled
In long:
Background: Aurae
Aurae uses CFS throttling to enforce cell CPU time quotes, similarly to Docker
as described in the various backgound articles above.
However it's made the interesting to choice to hide the CFS throttling period,
exposing only a max time quota field in its API. Furthermore, Aurae has
hardcoded the CFS period to be 1s, which is 10x its typical default value of
100ms.
Problem: Large Latency Artifacts
The primary problem with how Aurae's CPU quotas currently is large latency artifacts:
scheduled for the rest of an entire second
mundane 100% CPU utilization (1 OS CPU core): after running for 400ms, the
cell then pauses for 600ms
cell has 4 hot/running kernel tasks (threads, processes, VMs, whatever) then
the cell will run for 100ms then pause for 900ms
See the example section below for code and excerpt data exploring this effect.
In the case of a request processing service, these are SLO breaking levels of latency.
In fact, the typical 100ms CFS period is already material to such things.
Having even larger latency artifacts, now measured in the 600ms-900ms range,
might even be bad enough to affect things like health checks and cluster
traffic routing systems.
Proposal: at least expose CFS period ; maybe lower its default
At the very least, in my view, Aurae need to expose the CFS period alongside
max CPU time.
I'm less convinced about lowering the default:
abstracted percentage or logical core count is a laudable one
thing) should have to answer directly a question like "When my cell saturates
its CPU quota, what is the time penalty we're willing to pay for that?"
parts; but if we're not going to formally require it, having a default period
so large as to be unacceptable for anyone's SLO would also do, since in
practice it'll force everyone to specify a period eventually
Example Code / Data
To confirm my own recollection of how this all works, and allow easy
reproduction by others, I've written some examples programs in #406 :
Node/JavaScript work currently
specify
GOMAXPROCS
when running it, since the Go runtime still is notcontainer aware without a 3rd party library
demonstrate CPU quota saturation today; see it's
// NOTE
comment after itscells.start
call for instructions.Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota
After running for around 30 seconds, the node example program experiences 600-700ms latency excursions:
Corresponding kernel stats:
Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota
Here's similar result from running an analogous Go program for around 30 seconds:
Here the actual encountered a little lower since the CPU quota is a little less
oversubscribed; also the low end of the box stat may seem surprising, but is an
artifact of how a constant-interval Go ticker behaves after encountering runtime
lag; in other words, after coming out of a pause section, it delivered a couple
ticks in rapid succession.
Corresponding kernel stats:
Example Excerpt: Go burning 8 CPU cores within a 400ms/1_000ms quota
For a final extreme example, here's an even more over-subscribed Go example:
Here there aren't any "outliers" under a classic boxplot analysis, because the 75%-ile is so heavily skewed up around 980ms.
The text was updated successfully, but these errors were encountered: