Huge RAM usage on big file uploads #2856

thierryba · 2024-02-13T15:39:36Z

Describe the bug

I want to uploada big file. So I wanted to up the partSize of my s3crtclient Configuration.
But, it seems the RAM consumption of my process is a direct multiple (around 20x) of that value. So when I tried 50MB, my process was taking 1GB of RAM.

Expected Behavior

Uploading files should be simple enough that it consumes less RAM.

Current Behavior

It uses 20x the RAM of the part size. For a huge upload that is too much. And that means I cannot do more than 1 in parallel.

Reproduction Steps

see description

Possible Solution

No response

Additional Information/Context

No response

AWS CPP SDK version used

1.11.258

Compiler and Version used

Apple clang version 15.0.0 (clang-1500.1.0.2.5)

Operating System and version

macOS Sonoma 14.3

jmklix · 2024-02-13T19:46:57Z

Can you provide a minimal code sample that reproduces this? What partSize are you using?

thierryba · 2024-02-14T08:03:54Z

my minimal example

#include <iostream>
#include <fstream>
#include <aws/core/Aws.h>
#include <aws/s3-crt/S3CrtClient.h>
#include <aws/s3-crt/model/PutObjectRequest.h>


int main()
{
    Aws::SDKOptions options;
    Aws::InitAPI(options);
    Aws::S3Crt::ClientConfiguration conf;
    conf.partSize = 50 * 1024 * 1024; // that means using 1GB of RAM...
    Aws::Auth::AWSCredentials creds;
    creds.SetAWSAccessKeyId(Aws::String("your_Access_key"));
        creds.SetAWSSecretKey(Aws::String("your_secret_key"));
    const std::string fileName = "big file so that it takes a bit of time to upload";

    Aws::S3Crt::S3CrtClient client(creds, conf);

    Aws::S3Crt::Model::PutObjectRequest request;

    request.SetBucket("bucket name");
    request.SetKey("my_big_file_on_s3");

    std::shared_ptr<Aws::IOStream> inputData =
        Aws::MakeShared<Aws::FStream>("SampleAllocationTag",
                                      fileName.c_str(),
                                      std::ios_base::in | std::ios_base::binary);

    if (!*inputData) {
        std::cerr << "Error unable to read file " << fileName << std::endl;
        return 1;
    }

    request.SetBody(inputData);

    request.SetDataSentEventHandler([](const Aws::Http::HttpRequest*, long long) {
        std::cout << "callback" << std::endl;
    });

    Aws::S3Crt::Model::PutObjectOutcome outcome = client.PutObject(request);
    if (!outcome.IsSuccess()) {
        std::cerr << "Error: PutObject: " <<
            outcome.GetError().GetMessage() << std::endl;
    } else {
        std::cout << "DONE" << std::endl;
    }


    Aws::ShutdownAPI(options);
}

You can note that the callback is also not called but that is declared as a separate issue...

DmitriyMusatkin · 2024-02-14T21:18:49Z

CRT S3 client will automatically split big uploads into multiple parts and upload them in parallel. So during upload, crt will hold several part-sized buffers in memory depending on overall parallelism settings. So depending on how big the file is and how many parts you are trying to upload at the same time, 1GB might be a reasonable number.

On top of that CRT will pool buffers to avoid reallocating them over an over again, so you might see crt holding on to a larger chunk of memory than you would expect. buffer pools are cleared after some period of inactivity

thierryba · 2024-02-15T08:16:15Z

Well to be frank, 1GB to upload a file, whatever the size, is a huge price to pay. On restricted cloud environment, this is a ridiculous amount of RAM, not to mention that we could have multiple uploads simultaneously.
On top of this, the fact that it is not controllable makes s3crtclient completely useless for us. I have no idea what the size of the upload will be and I am not sure how the total size is computed. It seems to be something like 20*the part size... But how do I know for sure? what is it dependent on?

DmitriyMusatkin · 2024-02-17T01:11:29Z

S3 has a fairly low per connection throughput, so to reach decent amounts of throughput, crt needs to run several connections in parallel and buffer considerable portion of the data being uploaded. Amount of parallelism used by crt can be controlled by target throughput setting (https://github.com/aws/aws-sdk-cpp/blob/main/generated/src/aws-cpp-sdk-s3-crt/include/aws/s3-crt/ClientConfiguration.h#L58). Unfortunately, that setting already defaults to the lowest possible value in cpp sdl and setting it lower will not have impact on memory usage.

Note: that overall max memory usage for the client will have an upper bound that is derived from part size and number of connections (which in turn is derived from max throughput). so memory usage does not scale directly with the number of s3 requests queued up on the client and once that upper bound is reached, memory usage will stay there.

We've made several improvements to underlying C CRT libs with regards to memory usage un the past couple months that havent made its way to CPP SDK yet, so I would be interested in learning about your use cases. What kind of instances are you running code on? overall ram on the system and NIC bandwidth? what are the typical file sizes you are trying to upload?

thierryba · 2024-02-19T07:59:49Z

Hi @DmitriyMusatkin and thank you for the reply. I was actually wondering if setting the throughput to a lower value would help.. Heh too bad for me. I suppose that if the canes to the memory usage does not directly affect those buffers it will not help me much. In essence we are a SaaS provider and there are times where we need to push data. Most likely in files of a few GB but it can go to 10s of GB (there is no actual limit), hence my questions.
The possibilities to run this are actually pretty diverse. Because it could be a SaaS instance on EC2 or on prem.
In any case we are trying to be careful with resources, and 1GB is ridiculously high just to upload a file.

That being said, for now, we have switched to using TransferManager that allows to control better the memory management.
Also TM, allows you to get the current upload progress, which the s3crtclient is failing to do (callbacks are never called...).

jmklix · 2024-02-21T18:22:20Z

Thanks for bringing your use case to our attention. I'm sorry that s3crtclient doesn't currently fit you needs. I'm changing this issue to a feature request. This feature would be to add additional options for configuring the s3crtclient. If you have any ideas for which settings you would like to configure please let us know, but I can't guarantee that we will be able to implement them.

thierryba added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 13, 2024

jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Feb 13, 2024

jmklix self-assigned this Feb 13, 2024

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Feb 15, 2024

jmklix added feature-request A feature should be added or improved. and removed bug This issue is a bug. labels Feb 21, 2024

jmklix removed their assignment Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge RAM usage on big file uploads #2856

Huge RAM usage on big file uploads #2856

thierryba commented Feb 13, 2024

jmklix commented Feb 13, 2024

thierryba commented Feb 14, 2024 •

edited

DmitriyMusatkin commented Feb 14, 2024

thierryba commented Feb 15, 2024

DmitriyMusatkin commented Feb 17, 2024

thierryba commented Feb 19, 2024

jmklix commented Feb 21, 2024

Huge RAM usage on big file uploads #2856

Huge RAM usage on big file uploads #2856

Comments

thierryba commented Feb 13, 2024

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

AWS CPP SDK version used

Compiler and Version used

Operating System and version

jmklix commented Feb 13, 2024

thierryba commented Feb 14, 2024 • edited

DmitriyMusatkin commented Feb 14, 2024

thierryba commented Feb 15, 2024

DmitriyMusatkin commented Feb 17, 2024

thierryba commented Feb 19, 2024

jmklix commented Feb 21, 2024

thierryba commented Feb 14, 2024 •

edited