Use thread pool for DeleteDanglingObjectStorageFilesStep #216

kmichel-aiven · 2024-04-21T12:43:39Z

CPU intensive steps running in the main thread block all async operations.

This include other concurrent requests, leaving astacus completely unresponsive.

Start moving some steps entirely in a thread pool to avoid that.

We were already using the thread pool a lot anyway, since our object storage library
is sync, but now instead of entering/leaving the pool for every small operation, we
run an entire step as a single task.

[DDB-951]

The sync variant will be directly used in some steps. [DDB-951]

This implements `run_step` by running `run_sync_step` in a thread pool, this can be used for CPU intensive steps (which are probably all steps, given a large-enough backup). [DDB-951]

The main goal is to migrate `DeleteDanglingObjectStorageFilesStep` to run in a separate thread instead of keeping astacus stuck while the step is iterating through very long lists of object storage files. `RestoreObjectStorageFilesStep` was also migrated because it used the same APIs as `DeleteDanglingObjectStorageFilesStep`, to avoid keeping two implementations of everything.

There isn't anything async in this module. [DDB-951]

alanfranz · 2024-05-01T18:16:58Z

astacus/coordinator/plugins/base.py

@@ -17,6 +17,7 @@
 from astacus.coordinator.manifest import download_backup_manifest, download_backup_min_manifest
 from collections import Counter
 from collections.abc import Sequence, Set
+from starlette.concurrency import run_in_threadpool


I see no starlette dependency neither in astacus nor in aiven-core. I can imagine there's a transitive dependency, but if we start using it directly we should declare a direct dependency.

It's already imported in two other places:

astacus/astacus/common/asyncstorage.py

Line 9 in cc4f337

from starlette.concurrency import run_in_threadpool

astacus/astacus/common/cassandra/client.py

Line 19 in cc4f337

from starlette.concurrency import run_in_threadpool

But yes, I'll add it to our direct dependencies 👍

You're right it was already there, it wasn't so urgent.

alanfranz

I need to take a further look, there're quite a few changes - I should be able to do that in a couple of hours. BTW:

since our object storage library is sync

This seems quite a shortcoming and I wonder whether this should be addressed somehow.

kmichel-aiven · 2024-05-01T18:42:30Z

since our object storage library is sync

This seems quite a shortcoming and I wonder whether this should be addressed somehow.

It's also true of the zookeeper client, cassandra client, etc. Approximately everything except astacus client/server HTTP is sync code wrapped in threads and async awaited.

Given than "doing CPU-intensive work" happens in many place and is hard to diagnose (problems only happen with large backups), I'm leaning towards addressing it by progressively moving everything to threads and getting rid of async in all of astacus, not by trying to find async libraries for everything. We don't really need the theoretical efficiency of an astacus able to answer to thousand of concurrent queries.

(By "moving to threads", I mean "one query per thread" + "a few pools for things that are worth parallelizing and currently handled with asyncio.gather", not total chaos and tons of locks.)

It was already an indirect dependency and already imported directly in a few places, but we should make that explicit.

alanfranz · 2024-05-01T19:06:58Z

I'm leaning towards addressing it by progressively moving everything to threads and getting rid of async in all of astacus,

+1 on this. Async applications are usually efficient when everything is async, with only the occasional reliance on threads. If it's "oh my God it's full of threads" it's probably just easier to do everything in sync, rather than wasting time on making the two "colors" interoperate.

alanfranz

LGTM

kmichel-aiven added 4 commits April 21, 2024 14:08

extract sync code from download_backup_manifest

8fd1a21

The sync variant will be directly used in some steps. [DDB-951]

add SyncStep[T] variant of Step[T]

972fb26

This implements `run_step` by running `run_sync_step` in a thread pool, this can be used for CPU intensive steps (which are probably all steps, given a large-enough backup). [DDB-951]

clickhouse: rename async_object_storage to object_storage

8345bcd

There isn't anything async in this module. [DDB-951]

kmichel-aiven marked this pull request as ready for review April 21, 2024 13:09

kmichel-aiven requested a review from a team April 30, 2024 09:48

alanfranz reviewed May 1, 2024

View reviewed changes

add starlette as a direct dependency

dfd86af

It was already an indirect dependency and already imported directly in a few places, but we should make that explicit.

kmichel-aiven requested review from alanfranz and a team May 14, 2024 14:20

alanfranz approved these changes May 17, 2024

View reviewed changes

alanfranz merged commit 20a9c50 into main May 17, 2024
2 checks passed

alanfranz deleted the kmichel-threaded-steps branch May 17, 2024 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use thread pool for DeleteDanglingObjectStorageFilesStep #216

Use thread pool for DeleteDanglingObjectStorageFilesStep #216

kmichel-aiven commented Apr 21, 2024

alanfranz May 1, 2024

kmichel-aiven May 1, 2024

alanfranz May 1, 2024

alanfranz left a comment •

edited

kmichel-aiven commented May 1, 2024 •

edited

alanfranz commented May 1, 2024

alanfranz left a comment

Use thread pool for DeleteDanglingObjectStorageFilesStep #216

Use thread pool for DeleteDanglingObjectStorageFilesStep #216

Conversation

kmichel-aiven commented Apr 21, 2024

alanfranz May 1, 2024

Choose a reason for hiding this comment

kmichel-aiven May 1, 2024

Choose a reason for hiding this comment

alanfranz May 1, 2024

Choose a reason for hiding this comment

alanfranz left a comment • edited

Choose a reason for hiding this comment

kmichel-aiven commented May 1, 2024 • edited

alanfranz commented May 1, 2024

alanfranz left a comment

Choose a reason for hiding this comment

alanfranz left a comment •

edited

kmichel-aiven commented May 1, 2024 •

edited