Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading entire file into memory vs using lock to share file among threads #52

Open
millerdev opened this issue Sep 13, 2016 · 1 comment

Comments

@millerdev
Copy link

I recently upgraded to boto3 v1.4.0 from v1.2.3. The new version now uses s3transfer to support uploads, and boto3 Bucket objects now provide a method upload_fileobj for uploading files, which is very nice. The upload_fileobj method delegates to s3transfer. However, it appears that s3transfer reads the entire file into memory when chunking very large files, which has the potential to exhaust RAM. Would it make sense to use a lock to allow multiple threads to access the fileobj instead?

Feel free to reuse the code from that final link. I wrote it and would be happy to have it used here.

@kyleknap
Copy link
Member

Hi @millerdev

So the one reason that we did not go with the locking strategy is that we were concerned it would slow down the speed of the transfer significantly. For most environments, most of the time is spent in reading/uploading the body of the file so if a lock is around the reads we would be making this slow part of the transfer sequential.

So one thing to note with our current approach, while at some point each part of the file will be in the memory, the entirety of file will never be in memory at the same time (assuming the file is very large). To ensure that we do not exhaust RAM from reading the data into we have logic that will ensure a limited amount of file will be in memory for any point in the transfer process.

It is essentially a producer-consumer design where the producers place chunks of the files into a bounded queue and the consumers will pull off of the queue to upload the data and producers cannot place more data onto the queue until the workers make room for them by completing the upload. While there is some memory overhead with this approach, it gives us the ability to do multithreaded uploads with a bounded maximum memory usage. While this is not ideal (ideal would be a minimum memory overhead), it is the best we have so far come up with. The biggest problem is figuring out how to share a file-like object without locking and if we are able to figure that out, that would most likely produce the best performance in terms of both speed and memory. So we are open to suggestions with that.

To be fair, we have not actually tried the locking implementation that you are proposing. Fortunately, we wrote performance scripts to profile the differences between approaches and changes we make to this library. I would actually be quite interested to see what the differences in performance are between your implementation and the one that is currently in s3transfer. So feel free to run the benchmark and benchmark-upload scripts if you are interested in the results as well. Otherwise, we will get to benchmarking your implementation and see if it can improve the library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants