Epic: per-tenant write throttle #7564

problame · 2024-04-30T14:22:24Z

(Formerly, #5899 tracked throttling for both read and write path. We moved the write path throttle here to be able to reduce scope & be able to close #5889)

Motivation

walingest is compute intensive & affects shared resources (TX to S3, global rate limits to S3, etc). A single tenant shouldn't be able to exhaust it.
we still should have back-pressure as a defense-in-depth mechanism, but, that's separate from throttling
- currently no compaction back-pressure mechanism
  - long-standing known issue
- currently no upload queue back-pressure meachanism => pageserver: backpressure on layer freeze/flush #7317

DoD

Pageserver artificially caps the per-tenant throughput on the write path (=ingest).

I.e., to all upstream Neon components, this cap will appear to be the maximum ingest performance that you can get per tenant per pageserver.

Like with #5899, the limit will be chosen such that a TBD (small single-digit) number of tenants can run at the limit. Discovery of the limit values is done through gradual rollout, conservative experimentation, and informed by benchmarks.

There is enough observability to clearly disambiguate slowness induced by limiting from slowness caused by otherwise slow pageserver. This disambiguation must be on per-tenant (better: per-timeline) granularity.

The limits are on-by-default and cannot be permanently overridden on a per-tenant basis.
I.e., the implementation need not be suited for productization as "performance tier" or "QoS" feature.

TBD: specify how exactly the backpressure is propagated to SKs and Computes. The current "max lag" is insufficient; it's a hard limit.

Interactions

Sharding: with sharding, the throttling happens per shard instead of per tenant. Exactly like in #5899.

High-Level Plan

Write Path

Give feedback

implement ingestion benchmark
implement ingestion throttling mechanism (pageserver: WAL ingestion backpressure #5897)
choose initial conservative values
observability review for gradual rollout
gradual rollout, experimentation
Options

problame self-assigned this Apr 30, 2024

problame mentioned this issue Apr 30, 2024

Epic: per-tenant read path throttling #5899

Open

problame changed the title ~~Epic: per-tenant write throttle~~ [DRAFT] Epic: per-tenant write throttle Apr 30, 2024

problame removed their assignment Apr 30, 2024

jcsp changed the title ~~[DRAFT] Epic: per-tenant write throttle~~ Epic: per-tenant write throttle May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: per-tenant write throttle #7564

Epic: per-tenant write throttle #7564

problame commented Apr 30, 2024 •

edited

Write Path

Epic: per-tenant write throttle #7564

Epic: per-tenant write throttle #7564

Comments

problame commented Apr 30, 2024 • edited

Motivation

DoD

Interactions

High-Level Plan

Write Path

problame commented Apr 30, 2024 •

edited