S3Client instantiation extremely slow #2880

pitrou · 2024-02-29T16:39:34Z

Describe the bug

One of our users has reported that creating a S3Client instance can take up to 1 millisecond. According to profiler statistics, most of the time is spent in Aws::Crt::Endpoints::RuleEngine (see linked issue).

Expected Behavior

I would expect a S3Client to be reasonably fast to instantiate (less than a microsecond).

Current Behavior

See description and profile graph excerpt in linked issue.

Reproduction Steps

Using PyArrow:

>>> from pyarrow.fs import S3FileSystem
>>> %timeit s = S3FileSystem()
1.29 ms ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True)
1.29 ms ± 5.52 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit s = S3FileSystem(anonymous=True, region='eu-west-1')
1.29 ms ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

1.11.267

Compiler and Version used

gcc 12.3.0

Operating System and version

Ubuntu 22.04

The text was updated successfully, but these errors were encountered:

pitrou · 2024-02-29T16:40:09Z

cc @fjetter

pitrou · 2024-02-29T17:03:26Z

Ok, so it seems the JSON parsing step that takes most of the time in the profile graphs may be spent parsing this JSON string hardcoded (!!) in the SDK's C++ source code:

aws-sdk-cpp/generated/src/aws-cpp-sdk-s3/source/S3EndpointRules.cpp

Line 17 in 763386d

static constexpr RulesBlobT RulesBlob = {{

It decodes into this heavily-nested object (here in Python representation):
https://gist.github.com/pitrou/c4978f29be9d2d3cb9574cd9d262490a

jmklix · 2024-02-29T22:28:03Z

Thanks for pointing this out. Currently it's behaving within expected boundaries, but I'm interested to hear more of your thoughts on this. How fast are you wanting/expecting the S3Client to instantiate? You shouldn't be needing to instantiate the that often as it can be reused.

I noticed in the issue linked above you have improved the performance of your tests to less than 53 µs with "caching PoC". Were there any changes you where wanting to be made on the sdk side?

pitrou · 2024-03-01T09:30:00Z

How fast are you wanting/expecting the S3Client to instantiate? You shouldn't be needing to instantiate the that often as it can be reused.

I'll let @fjetter elaborate on their situation, but when distributing individual tasks over a cluster of workers there's a need to deserialize everything that's needed to run such tasks. If a task entails loading data over S3 (with potentially different configurations, since tasks from multiple users or workloads might be in flight), it implies recreating a S3Client each time. Depending on task granularity, I suspect 1ms to instantiate a S3Client might appear as a significant contributor in performance profiles.

pitrou · 2024-03-01T09:35:50Z

For the record, here's the current prototype that seems to work on our CI. I ended up caching endpoint providers based on the S3 client configuration's relevant options (the ones that influence the provider initialization). There's an additional complication (InitOnceEndpointProvider) due to the fact that S3Client::S3Client always reconfigures the endpoint provider, even when it is explicitly passed by the caller, and that is not thread-safe.
https://github.com/apache/arrow/blob/2be51947448aac17da5eb4e7b284483da72f7f41/cpp/src/arrow/filesystem/s3fs.cc#L916-L1022

It would probably have been simpler if I could simply have explicitly created a shared RuleEngine with the S3 default rules, and instantiate each S3EndpointProvider from that same RuleEngine.

fjetter · 2024-03-01T12:06:24Z

How fast are you wanting/expecting the S3Client to instantiate? You shouldn't be needing to instantiate the that often as it can be reused.

I'll let @fjetter elaborate on their situation, but when distributing individual tasks over a cluster of workers there's a need to deserialize everything that's needed to run such tasks. If a task entails loading data over S3 (with potentially different configurations, since tasks from multiple users or workloads might be in flight), it implies recreating a S3Client each time. Depending on task granularity, I suspect 1ms to instantiate a S3Client might appear as a significant contributor in performance profiles.

That sums it up nicely. Due to how arrow and dask is built, we end up instantiating possibly thousands of clients adding up to a couple of seconds in latency whenever we're trying to read a dataset. We essentially end up creating one s3client per file, reusing it is a little difficult at this point.

jmklix · 2024-03-08T23:10:22Z

This is ultimately a feature request so I will be changing this issue to a feature-request. It's something that we would like to improve the speed of, but I don't have a timeline for when that might happen.

In the short term I would recommend to use a single endpoint resolver for all of you s3clients. You can do this by overloading the client when you initialize it.

pitrou · 2024-03-14T15:44:28Z

In the short term I would recommend to use a single endpoint resolver for all of you s3clients. You can do this by overloading the client when you initialize it.

We have a proposed workaround now, but it's slightly more complicated than that:

we are a library, so cannot assume a single endpoint resolver since different s3clients may be requested for different endpoint configurations; we need to maintain a cache (which must also be appropriately flush betfore S3 shutdown)
S3Client unfortunately does non-thread-safe reconfiguration of the given endpoint resolver in its constructor. That's harmless if the endpoint resolver is specific to a given client, not if the endpoint resolver is shared between multiple clients. We therefore had to implement an immutable endpoint resolver wrapper.

pitrou added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 29, 2024

jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Feb 29, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Mar 2, 2024

jmklix added feature-request A feature should be added or improved. p3 This is a minor priority issue and removed p2 This is a standard priority issue bug This issue is a bug. labels Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3Client instantiation extremely slow #2880

S3Client instantiation extremely slow #2880

pitrou commented Feb 29, 2024

pitrou commented Feb 29, 2024

pitrou commented Feb 29, 2024

jmklix commented Feb 29, 2024 •

edited

pitrou commented Mar 1, 2024

pitrou commented Mar 1, 2024

fjetter commented Mar 1, 2024

jmklix commented Mar 8, 2024

pitrou commented Mar 14, 2024 •

edited

S3Client instantiation extremely slow #2880

S3Client instantiation extremely slow #2880

Comments

pitrou commented Feb 29, 2024

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

AWS CPP SDK version used

Compiler and Version used

Operating System and version

pitrou commented Feb 29, 2024

pitrou commented Feb 29, 2024

jmklix commented Feb 29, 2024 • edited

pitrou commented Mar 1, 2024

pitrou commented Mar 1, 2024

fjetter commented Mar 1, 2024

jmklix commented Mar 8, 2024

pitrou commented Mar 14, 2024 • edited

jmklix commented Feb 29, 2024 •

edited

pitrou commented Mar 14, 2024 •

edited