Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use thread pool for DeleteDanglingObjectStorageFilesStep #216

Merged
merged 5 commits into from May 17, 2024

Conversation

kmichel-aiven
Copy link
Contributor

CPU intensive steps running in the main thread block all async operations.

This include other concurrent requests, leaving astacus completely unresponsive.

Start moving some steps entirely in a thread pool to avoid that.

We were already using the thread pool a lot anyway, since our object storage library
is sync, but now instead of entering/leaving the pool for every small operation, we
run an entire step as a single task.

[DDB-951]

The sync variant will be directly used in some steps.
[DDB-951]
This implements `run_step` by running `run_sync_step` in a thread pool,
this can be used for CPU intensive steps (which are probably all
steps, given a large-enough backup).

[DDB-951]
The main goal is to migrate `DeleteDanglingObjectStorageFilesStep` to
run in a separate thread instead of keeping astacus stuck while the
step is iterating through very long lists of object storage files.

`RestoreObjectStorageFilesStep` was also migrated because it used
the same APIs as `DeleteDanglingObjectStorageFilesStep`, to avoid
keeping two implementations of everything.
There isn't anything async in this module.

[DDB-951]
@kmichel-aiven kmichel-aiven marked this pull request as ready for review April 21, 2024 13:09
@kmichel-aiven kmichel-aiven requested a review from a team April 30, 2024 09:48
@@ -17,6 +17,7 @@
from astacus.coordinator.manifest import download_backup_manifest, download_backup_min_manifest
from collections import Counter
from collections.abc import Sequence, Set
from starlette.concurrency import run_in_threadpool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no starlette dependency neither in astacus nor in aiven-core. I can imagine there's a transitive dependency, but if we start using it directly we should declare a direct dependency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already imported in two other places:

from starlette.concurrency import run_in_threadpool

from starlette.concurrency import run_in_threadpool

But yes, I'll add it to our direct dependencies 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right it was already there, it wasn't so urgent.

Copy link
Contributor

@alanfranz alanfranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take a further look, there're quite a few changes - I should be able to do that in a couple of hours. BTW:

since our object storage library is sync

This seems quite a shortcoming and I wonder whether this should be addressed somehow.

@kmichel-aiven
Copy link
Contributor Author

kmichel-aiven commented May 1, 2024

since our object storage library is sync

This seems quite a shortcoming and I wonder whether this should be addressed somehow.

It's also true of the zookeeper client, cassandra client, etc. Approximately everything except astacus client/server HTTP is sync code wrapped in threads and async awaited.

Given than "doing CPU-intensive work" happens in many place and is hard to diagnose (problems only happen with large backups), I'm leaning towards addressing it by progressively moving everything to threads and getting rid of async in all of astacus, not by trying to find async libraries for everything. We don't really need the theoretical efficiency of an astacus able to answer to thousand of concurrent queries.

(By "moving to threads", I mean "one query per thread" + "a few pools for things that are worth parallelizing and currently handled with asyncio.gather", not total chaos and tons of locks.)

It was already an indirect dependency and already imported directly in
 a few places, but we should make that explicit.
@alanfranz
Copy link
Contributor

I'm leaning towards addressing it by progressively moving everything to threads and getting rid of async in all of astacus,

+1 on this. Async applications are usually efficient when everything is async, with only the occasional reliance on threads. If it's "oh my God it's full of threads" it's probably just easier to do everything in sync, rather than wasting time on making the two "colors" interoperate.

@kmichel-aiven kmichel-aiven requested review from alanfranz and a team May 14, 2024 14:20
Copy link
Contributor

@alanfranz alanfranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alanfranz alanfranz merged commit 20a9c50 into main May 17, 2024
2 checks passed
@alanfranz alanfranz deleted the kmichel-threaded-steps branch May 17, 2024 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants