Run domains in a threadpool #1218

gvsg-rs · 2024-04-11T18:36:41Z

Previously, we used tokio::task::block_in_place to run domains, which blocks the thread it's currently running on until the task completes. This prevents the executor from using that thread to make progress on any other tasks, which is not efficient. CL-1268 removed the block_in_place strategy by spawning a native OS thread for each domain instead, which is how the system works today. This is highly resource-inefficient and has also led to us hitting the upper limits on the number of threads with active tracing spans (4096).

A quote from that CL:

The thing is spawn_blocking spawns a *blocking* task on a *blocking thread*, whereas our Replica is actually asynchronous, so it would not work at all. Moreover the blocking tasks run in a thread pool that has a limited size, and we don't know a-priory how high to set it. It defaults to 512, but there is no reason for us not to have more domains, and once we run out, spawning stops.
Spawn blocking performs even worse than block_in_place BTW.

Generally speaking, blocking I/O bound work is very well-suited to tokio's built-in blocking threadpool. Further, it is now possible to configure the size of the blocking threadpool using the max_blocking_threads method on the runtime builder. We should re-investigate the performance of tokio's spawn_blocking method in the context of domains.

For work that is CPU-bound, we should consider using the rayon crate, which is typically the go-to tool for spawning blocking CPU-bound tasks.

The text was updated successfully, but these errors were encountered:

altmannmarcelo · 2024-04-11T18:43:16Z

If we run against a setup that has a high number of tables, we can exceed the limit of threads and Readyset will panic:

Apr 11 17:20:35 ip-10-0-5-246 readyset[2686075]: Thread count overflowed the configured max count. Thread index = 4097, max threads = 4096.
Apr 11 17:20:35 ip-10-0-5-246 readyset[2686075]: thread 'Domain 1493.0.0' panicked at readyset-dataflow/src/domain/mod.rs:732:10:
Apr 11 17:20:35 ip-10-0-5-246 readyset[2686075]: called `Result::unwrap()` on an `Err` value: JoinError::Panic(Id(6925), ...)

A workaround is to limit the number of tables either via:

Limit the number of databases in the --upstream-db-url
Limit the number of tables via --replication-tables or --replication-tables-ignore

gvsg-rs assigned ethan-readyset Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run domains in a threadpool #1218

Run domains in a threadpool #1218

gvsg-rs commented Apr 11, 2024

altmannmarcelo commented Apr 11, 2024

Run domains in a threadpool #1218

Run domains in a threadpool #1218

Comments

gvsg-rs commented Apr 11, 2024

altmannmarcelo commented Apr 11, 2024