Concerns About CPU Quota Implementation #407

jcorbin · 2023-02-14T21:10:10Z

Background: Linux CFS Throttling

In short, CFS CPU throttling works by:

defining an on-CPU time quota over a given wallclock period
once a cgroup exceeds its quota within a period, it will not be scheduled at
all until the next period
so the more CPU thirsty your workload is, the more you have to be mindful
about staying within its time quota, or suffer potentially large latency
artifacts due to being descheduled

In long:

Linux docs say much more
many others have written much more in depth about this:
- Linkedin 2016 -- has excellent timeline pictures
- Indeed 2019
- Dan Luu

Background: Aurae

Aurae uses CFS throttling to enforce cell CPU time quotes, similarly to Docker
as described in the various backgound articles above.

However it's made the interesting to choice to hide the CFS throttling period,
exposing only a max time quota field in its API. Furthermore, Aurae has
hardcoded the CFS period to be 1s, which is 10x its typical default value of
100ms.

Problem: Large Latency Artifacts

The primary problem with how Aurae's CPU quotas currently is large latency artifacts:

once a cell has spent its max CPU time, none of its running processes will be
scheduled for the rest of an entire second
let's use the common 400ms setting used in many of the examples, and assume a
mundane 100% CPU utilization (1 OS CPU core): after running for 400ms, the
cell then pauses for 600ms
the problem only gets worse the more parallel a cell's workload is: if the
cell has 4 hot/running kernel tasks (threads, processes, VMs, whatever) then
the cell will run for 100ms then pause for 900ms

See the example section below for code and excerpt data exploring this effect.

In the case of a request processing service, these are SLO breaking levels of latency.
In fact, the typical 100ms CFS period is already material to such things.

Having even larger latency artifacts, now measured in the 600ms-900ms range,
might even be bad enough to affect things like health checks and cluster
traffic routing systems.

Proposal: at least expose CFS period ; maybe lower its default

At the very least, in my view, Aurae need to expose the CFS period alongside
max CPU time.

I'm less convinced about lowering the default:

to me Aurae's choice to directly expose CFS time quota, rather than a more
abstracted percentage or logical core count is a laudable one
to that end, I feel that users (especially ones for whom a letency SLO is a
thing) should have to answer directly a question like "When my cell saturates
its CPU quota, what is the time penalty we're willing to pay for that?"
to that end, I'd prefer to see every cell CPU allocation specifying both
parts; but if we're not going to formally require it, having a default period
so large as to be unacceptable for anyone's SLO would also do, since in
practice it'll force everyone to specify a period eventually

Example Code / Data

To confirm my own recollection of how this all works, and allow easy
reproduction by others, I've written some examples programs in #406 :

2 examples programs that burn as much CPU as they can as a saturation test
- the Node.JS one can only burn (a bit more than) 1 CPU core due to how
  Node/JavaScript work currently
- the Go version can burn as many cores as you've got, and indeed, one must
  specify GOMAXPROCS when running it, since the Go runtime still is not
  container aware without a 3rd party library
the supporting auraescript runner is a bit hacky, but gets the job done to
demonstrate CPU quota saturation today; see it's // NOTE comment after its
cells.start call for instructions.

Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota

After running for around 30 seconds, the node example program experiences 600-700ms latency excursions:

[lag report] min:100 max:699 box:[ 101 101 276 ] hi:626 hiOutliers:24 2.42%
  ┌─────────┬───────┬────────┬──────────┐
  │ (index) │ start │ actual │ expected │
  ├─────────┼───────┼────────┼──────────┤
  │    0    │ 1815  │  697   │   100    │
  │    1    │ 5815  │  697   │   100    │
  │    2    │ 8815  │  697   │   100    │
  │    3    │ 9815  │  697   │   100    │
  │    4    │ 10815 │  697   │   100    │
  │    5    │ 12815 │  697   │   100    │
  │    6    │ 13815 │  697   │   100    │
  │    7    │ 17815 │  697   │   100    │
  │    8    │ 18815 │  697   │   100    │
  │    9    │ 21815 │  697   │   100    │
  │   10    │  815  │  698   │   100    │
  │   11    │ 2814  │  698   │   100    │
  │   12    │ 4814  │  698   │   100    │
  │   13    │ 7814  │  698   │   100    │
  │   14    │ 14814 │  698   │   100    │
  │   15    │ 16814 │  698   │   100    │
  │   16    │ 19814 │  698   │   100    │
  │   17    │ 22814 │  698   │   100    │
  │   18    │ 23814 │  698   │   100    │
  │   19    │ 3814  │  699   │   100    │
  │   20    │ 6813  │  699   │   100    │
  │   21    │ 11813 │  699   │   100    │
  │   22    │ 15813 │  699   │   100    │
  │   23    │ 20813 │  699   │   100    │
  └─────────┴───────┴────────┴──────────┘

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 12794354
user_usec 12704213
system_usec 90141
core_sched.force_idle_usec 0
nr_periods 31
nr_throttled 31
throttled_usec 18312824
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota

Here's similar result from running an analogous Go program for around 30 seconds:

2023/02/14 15:58:51 [lag report] min:27.89854ms max:472.131789ms box:[ 99.880519ms 99.992654ms 100.151136ms ] hi:101.992654ms hiOutliers:12 12.0%
2023/02/14 15:58:51 {Start:2023-02-14 15:58:39.214184729 -0500 EST m=+20.401723712 End:2023-02-14 15:58:39.485175472 -0500 EST m=+20.672714508 Actual:270.990796ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:49.014233509 -0500 EST m=+30.201772457 End:2023-02-14 15:58:49.485037021 -0500 EST m=+30.672576095 Actual:470.803638ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:45.014041338 -0500 EST m=+26.201580281 End:2023-02-14 15:58:45.485087655 -0500 EST m=+26.672626688 Actual:471.046407ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:51.01440381 -0500 EST m=+32.201942761 End:2023-02-14 15:58:51.485465242 -0500 EST m=+32.673004332 Actual:471.061571ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:43.014046723 -0500 EST m=+24.201585667 End:2023-02-14 15:58:43.48512526 -0500 EST m=+24.672664308 Actual:471.078641ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:44.01403297 -0500 EST m=+25.201571914 End:2023-02-14 15:58:44.485149938 -0500 EST m=+25.672688966 Actual:471.117052ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:48.01406235 -0500 EST m=+29.201601293 End:2023-02-14 15:58:48.48522296 -0500 EST m=+29.672762000 Actual:471.160707ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:50.014036425 -0500 EST m=+31.201575373 End:2023-02-14 15:58:50.485200964 -0500 EST m=+31.672739934 Actual:471.164561ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:40.014035733 -0500 EST m=+21.201574682 End:2023-02-14 15:58:40.485247383 -0500 EST m=+21.672786397 Actual:471.211715ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:42.014066621 -0500 EST m=+23.201605565 End:2023-02-14 15:58:42.485391197 -0500 EST m=+23.672930199 Actual:471.324634ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:46.014037924 -0500 EST m=+27.201576868 End:2023-02-14 15:58:46.485420931 -0500 EST m=+27.672959951 Actual:471.383083ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:47.014097749 -0500 EST m=+28.201636694 End:2023-02-14 15:58:47.486229431 -0500 EST m=+28.673768483 Actual:472.131789ms Expected:100ms}

Here the actual encountered a little lower since the CPU quota is a little less
oversubscribed; also the low end of the box stat may seem surprising, but is an
artifact of how a constant-interval Go ticker behaves after encountering runtime
lag; in other words, after coming out of a pause section, it delivered a couple
ticks in rapid succession.

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 65863014
user_usec 60727707
system_usec 5135306
core_sched.force_idle_usec 0
nr_periods 32
nr_throttled 32
throttled_usec 32261496
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 8 CPU cores within a 400ms/1_000ms quota

For a final extreme example, here's an even more over-subscribed Go example:

2023/02/14 16:03:14 [lag report] min:17.276279ms max:982.717708ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:15 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:16 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:17 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:18 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:19 [lag report] min:17.276279ms max:982.906476ms box:[ 17.482207ms 100.660104ms 982.523742ms ] no outliers withing threshold:2ms

Here there aren't any "outliers" under a classic boxplot analysis, because the 75%-ile is so heavily skewed up around 980ms.

The text was updated successfully, but these errors were encountered:

dmah42 · 2023-02-15T08:09:18Z

this was completely my decision to hide the max, thinking it would make it simpler to reason about for end users, but you're right that it also limits flexibility and more throttling comes at a significant cost.

so let's just expose the max :)

bnm3k · 2023-03-31T17:52:06Z

Just a heads up, the links to the linux docs, Linkedin 2016, Indeed and Dan Luu articles are missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concerns About CPU Quota Implementation #407

Concerns About CPU Quota Implementation #407

jcorbin commented Feb 14, 2023

dmah42 commented Feb 15, 2023

bnm3k commented Mar 31, 2023

Concerns About CPU Quota Implementation #407

Concerns About CPU Quota Implementation #407

Comments

jcorbin commented Feb 14, 2023

Background: Linux CFS Throttling

Background: Aurae

Problem: Large Latency Artifacts

Proposal: at least expose CFS period ; maybe lower its default

Example Code / Data

Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota

Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota

Example Excerpt: Go burning 8 CPU cores within a 400ms/1_000ms quota

dmah42 commented Feb 15, 2023

bnm3k commented Mar 31, 2023