Reading entire file into memory vs using lock to share file among threads #52

millerdev · 2016-09-13T15:16:13Z

I recently upgraded to boto3 v1.4.0 from v1.2.3. The new version now uses s3transfer to support uploads, and boto3 Bucket objects now provide a method upload_fileobj for uploading files, which is very nice. The upload_fileobj method delegates to s3transfer. However, it appears that s3transfer reads the entire file into memory when chunking very large files, which has the potential to exhaust RAM. Would it make sense to use a lock to allow multiple threads to access the fileobj instead?

Feel free to reuse the code from that final link. I wrote it and would be happy to have it used here.

The text was updated successfully, but these errors were encountered:

kyleknap · 2016-09-13T16:52:15Z

Hi @millerdev

So the one reason that we did not go with the locking strategy is that we were concerned it would slow down the speed of the transfer significantly. For most environments, most of the time is spent in reading/uploading the body of the file so if a lock is around the reads we would be making this slow part of the transfer sequential.

So one thing to note with our current approach, while at some point each part of the file will be in the memory, the entirety of file will never be in memory at the same time (assuming the file is very large). To ensure that we do not exhaust RAM from reading the data into we have logic that will ensure a limited amount of file will be in memory for any point in the transfer process.

It is essentially a producer-consumer design where the producers place chunks of the files into a bounded queue and the consumers will pull off of the queue to upload the data and producers cannot place more data onto the queue until the workers make room for them by completing the upload. While there is some memory overhead with this approach, it gives us the ability to do multithreaded uploads with a bounded maximum memory usage. While this is not ideal (ideal would be a minimum memory overhead), it is the best we have so far come up with. The biggest problem is figuring out how to share a file-like object without locking and if we are able to figure that out, that would most likely produce the best performance in terms of both speed and memory. So we are open to suggestions with that.

To be fair, we have not actually tried the locking implementation that you are proposing. Fortunately, we wrote performance scripts to profile the differences between approaches and changes we make to this library. I would actually be quite interested to see what the differences in performance are between your implementation and the one that is currently in s3transfer. So feel free to run the benchmark and benchmark-upload scripts if you are interested in the results as well. Otherwise, we will get to benchmarking your implementation and see if it can improve the library.

millerdev mentioned this issue Sep 13, 2016

Add support for internal copy in S3 blob db dimagi/commcare-hq#13203

Merged

kyleknap added the researching label Sep 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading entire file into memory vs using lock to share file among threads #52

Reading entire file into memory vs using lock to share file among threads #52

millerdev commented Sep 13, 2016

kyleknap commented Sep 13, 2016

Reading entire file into memory vs using lock to share file among threads #52

Reading entire file into memory vs using lock to share file among threads #52

Comments

millerdev commented Sep 13, 2016

kyleknap commented Sep 13, 2016