Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns About CPU Quota Implementation #407

Open
jcorbin opened this issue Feb 14, 2023 · 2 comments
Open

Concerns About CPU Quota Implementation #407

jcorbin opened this issue Feb 14, 2023 · 2 comments

Comments

@jcorbin
Copy link
Contributor

jcorbin commented Feb 14, 2023

Background: Linux CFS Throttling

In short, CFS CPU throttling works by:

  • defining an on-CPU time quota over a given wallclock period
  • once a cgroup exceeds its quota within a period, it will not be scheduled at
    all until the next period
  • so the more CPU thirsty your workload is, the more you have to be mindful
    about staying within its time quota, or suffer potentially large latency
    artifacts due to being descheduled

In long:

Background: Aurae

Aurae uses CFS throttling to enforce cell CPU time quotes, similarly to Docker
as described in the various backgound articles above.

However it's made the interesting to choice to hide the CFS throttling period,
exposing only a max time quota field in its API. Furthermore, Aurae has
hardcoded the CFS period to be 1s, which is 10x its typical default value of
100ms.

Problem: Large Latency Artifacts

The primary problem with how Aurae's CPU quotas currently is large latency artifacts:

  • once a cell has spent its max CPU time, none of its running processes will be
    scheduled for the rest of an entire second
  • let's use the common 400ms setting used in many of the examples, and assume a
    mundane 100% CPU utilization (1 OS CPU core): after running for 400ms, the
    cell then pauses for 600ms
  • the problem only gets worse the more parallel a cell's workload is: if the
    cell has 4 hot/running kernel tasks (threads, processes, VMs, whatever) then
    the cell will run for 100ms then pause for 900ms

See the example section below for code and excerpt data exploring this effect.

In the case of a request processing service, these are SLO breaking levels of latency.
In fact, the typical 100ms CFS period is already material to such things.

Having even larger latency artifacts, now measured in the 600ms-900ms range,
might even be bad enough to affect things like health checks and cluster
traffic routing systems.

Proposal: at least expose CFS period ; maybe lower its default

At the very least, in my view, Aurae need to expose the CFS period alongside
max CPU time.

I'm less convinced about lowering the default:

  • to me Aurae's choice to directly expose CFS time quota, rather than a more
    abstracted percentage or logical core count is a laudable one
  • to that end, I feel that users (especially ones for whom a letency SLO is a
    thing) should have to answer directly a question like "When my cell saturates
    its CPU quota, what is the time penalty we're willing to pay for that?"
  • to that end, I'd prefer to see every cell CPU allocation specifying both
    parts; but if we're not going to formally require it, having a default period
    so large as to be unacceptable for anyone's SLO would also do, since in
    practice it'll force everyone to specify a period eventually

Example Code / Data

To confirm my own recollection of how this all works, and allow easy
reproduction by others, I've written some examples programs in #406 :

  • 2 examples programs that burn as much CPU as they can as a saturation test
    • the Node.JS one can only burn (a bit more than) 1 CPU core due to how
      Node/JavaScript work currently
    • the Go version can burn as many cores as you've got, and indeed, one must
      specify GOMAXPROCS when running it, since the Go runtime still is not
      container aware
      without a 3rd party library
  • the supporting auraescript runner is a bit hacky, but gets the job done to
    demonstrate CPU quota saturation today; see it's // NOTE comment after its
    cells.start call for instructions.

Example Excerpt: Node.JS burning about 1 CPU core within a 400ms/1_000ms quota

After running for around 30 seconds, the node example program experiences 600-700ms latency excursions:

[lag report] min:100 max:699 box:[ 101 101 276 ] hi:626 hiOutliers:24 2.42%
  ┌─────────┬───────┬────────┬──────────┐
  │ (index) │ start │ actual │ expected │
  ├─────────┼───────┼────────┼──────────┤
  │    0    │ 1815  │  697   │   100    │
  │    1    │ 5815  │  697   │   100    │
  │    2    │ 8815  │  697   │   100    │
  │    3    │ 9815  │  697   │   100    │
  │    4    │ 10815 │  697   │   100    │
  │    5    │ 12815 │  697   │   100    │
  │    6    │ 13815 │  697   │   100    │
  │    7    │ 17815 │  697   │   100    │
  │    8    │ 18815 │  697   │   100    │
  │    9    │ 21815 │  697   │   100    │
  │   10    │  815  │  698   │   100    │
  │   11    │ 2814  │  698   │   100    │
  │   12    │ 4814  │  698   │   100    │
  │   13    │ 7814  │  698   │   100    │
  │   14    │ 14814 │  698   │   100    │
  │   15    │ 16814 │  698   │   100    │
  │   16    │ 19814 │  698   │   100    │
  │   17    │ 22814 │  698   │   100    │
  │   18    │ 23814 │  698   │   100    │
  │   19    │ 3814  │  699   │   100    │
  │   20    │ 6813  │  699   │   100    │
  │   21    │ 11813 │  699   │   100    │
  │   22    │ 15813 │  699   │   100    │
  │   23    │ 20813 │  699   │   100    │
  └─────────┴───────┴────────┴──────────┘

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 12794354
user_usec 12704213
system_usec 90141
core_sched.force_idle_usec 0
nr_periods 31
nr_throttled 31
throttled_usec 18312824
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 4 CPU cores within a 2_000ms/1_000ms quota

Here's similar result from running an analogous Go program for around 30 seconds:

2023/02/14 15:58:51 [lag report] min:27.89854ms max:472.131789ms box:[ 99.880519ms 99.992654ms 100.151136ms ] hi:101.992654ms hiOutliers:12 12.0%
2023/02/14 15:58:51 {Start:2023-02-14 15:58:39.214184729 -0500 EST m=+20.401723712 End:2023-02-14 15:58:39.485175472 -0500 EST m=+20.672714508 Actual:270.990796ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:49.014233509 -0500 EST m=+30.201772457 End:2023-02-14 15:58:49.485037021 -0500 EST m=+30.672576095 Actual:470.803638ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:45.014041338 -0500 EST m=+26.201580281 End:2023-02-14 15:58:45.485087655 -0500 EST m=+26.672626688 Actual:471.046407ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:51.01440381 -0500 EST m=+32.201942761 End:2023-02-14 15:58:51.485465242 -0500 EST m=+32.673004332 Actual:471.061571ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:43.014046723 -0500 EST m=+24.201585667 End:2023-02-14 15:58:43.48512526 -0500 EST m=+24.672664308 Actual:471.078641ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:44.01403297 -0500 EST m=+25.201571914 End:2023-02-14 15:58:44.485149938 -0500 EST m=+25.672688966 Actual:471.117052ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:48.01406235 -0500 EST m=+29.201601293 End:2023-02-14 15:58:48.48522296 -0500 EST m=+29.672762000 Actual:471.160707ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:50.014036425 -0500 EST m=+31.201575373 End:2023-02-14 15:58:50.485200964 -0500 EST m=+31.672739934 Actual:471.164561ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:40.014035733 -0500 EST m=+21.201574682 End:2023-02-14 15:58:40.485247383 -0500 EST m=+21.672786397 Actual:471.211715ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:42.014066621 -0500 EST m=+23.201605565 End:2023-02-14 15:58:42.485391197 -0500 EST m=+23.672930199 Actual:471.324634ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:46.014037924 -0500 EST m=+27.201576868 End:2023-02-14 15:58:46.485420931 -0500 EST m=+27.672959951 Actual:471.383083ms Expected:100ms}
2023/02/14 15:58:51 {Start:2023-02-14 15:58:47.014097749 -0500 EST m=+28.201636694 End:2023-02-14 15:58:47.486229431 -0500 EST m=+28.673768483 Actual:472.131789ms Expected:100ms}

Here the actual encountered a little lower since the CPU quota is a little less
oversubscribed; also the low end of the box stat may seem surprising, but is an
artifact of how a constant-interval Go ticker behaves after encountering runtime
lag; in other words, after coming out of a pause section, it delivered a couple
ticks in rapid succession.

Corresponding kernel stats:

$ cat /sys/fs/cgroup/cpu-burn-room/cpu.stat
usage_usec 65863014
user_usec 60727707
system_usec 5135306
core_sched.force_idle_usec 0
nr_periods 32
nr_throttled 32
throttled_usec 32261496
nr_bursts 0
burst_usec 0

Example Excerpt: Go burning 8 CPU cores within a 400ms/1_000ms quota

For a final extreme example, here's an even more over-subscribed Go example:

2023/02/14 16:03:14 [lag report] min:17.276279ms max:982.717708ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:15 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:16 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:17 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:18 [lag report] min:17.276279ms max:982.719185ms box:[ 17.490314ms 100.660104ms 982.505146ms ] no outliers withing threshold:2ms
2023/02/14 16:03:19 [lag report] min:17.276279ms max:982.906476ms box:[ 17.482207ms 100.660104ms 982.523742ms ] no outliers withing threshold:2ms

Here there aren't any "outliers" under a classic boxplot analysis, because the 75%-ile is so heavily skewed up around 980ms.

@dmah42
Copy link
Contributor

dmah42 commented Feb 15, 2023

this was completely my decision to hide the max, thinking it would make it simpler to reason about for end users, but you're right that it also limits flexibility and more throttling comes at a significant cost.

so let's just expose the max :)

@bnm3k
Copy link

bnm3k commented Mar 31, 2023

Just a heads up, the links to the linux docs, Linkedin 2016, Indeed and Dan Luu articles are missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants