New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proactively upload file parts in PrestoS3FileSystem #22424
base: master
Are you sure you want to change the base?
Conversation
47754cf
to
c184cff
Compare
presto-hive-common/src/main/java/com/facebook/presto/hive/HiveCommonSessionProperties.java
Outdated
Show resolved
Hide resolved
presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/s3/TestPrestoS3FileSystem.java
Show resolved
Hide resolved
84926b8
to
5f50c21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Could you measure what the impact would be be if we fixed this problem only without replacing the TransferManager ? I'm also curious if you played around with |
|
||
private String uploadSingleFile() | ||
{ | ||
TransferManager transfer = TransferManagerBuilder.standard() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using TransferManager here ? If we are at this point, doesn't it imply that the file we want to upload is smaller than minPartSize and therefore doesn't really need a multi-part upload, i.e a simple PutObjectRequest will suffice ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be that the min part size was configured to a very large size. I didn't dig into what optimizations the transfer manager might make for larger-sized files. I just deferred to that implementation since it seems robust enough to handle large file uploads
presto-hive/src/main/java/com/facebook/presto/hive/s3/PrestoS3FileSystem.java
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/s3/TestPrestoS3FileSystem.java
Show resolved
Hide resolved
@aaneja I re-ran the benchmarks for creating the The main takeaways I see are that
|
5f50c21
to
ec82b5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % some minor changes
presto-hive/src/test/java/com/facebook/presto/hive/s3/TestPrestoS3FileSystem.java
Outdated
Show resolved
Hide resolved
presto-hive/src/test/java/com/facebook/presto/hive/s3/TestPrestoS3FileSystem.java
Outdated
Show resolved
Hide resolved
Aside : Your experiment showed that the default |
ec82b5e
to
8905ffa
Compare
I did some digging into In these few experiments I ran without compression and with NVMe disks to try to achieve the highest data rate possible to stress the new implementation. We found that the performance increased, but for other workloads without compression I can't say that we would see gains as large. I could also see a situation which, if deployed using traditional spinning platter disks, concurrent writes would lead to abysmal performance. I don't think there's a one-size fits all where we could guarantee better performance, but I think most "standard" use cases ("standard" referring to our standard cluster types used for benchmarking) would probably benefit from a writer count at least equal to the number of threads on the machine. Most deployments nowadays are probably using SSD and have multiple cores and NVMe SSDs are capable of a significant number of IOPS, so I think it's probably safe to increase the parallelism by a bit since it seems to give a measurable benefit. This does affect table layouts though (more files), so that's another thing to consider. |
I think we need the task writer to be adaptive to scale out/in based on utilization levels. I'm not sure if Prestissimo does this either, I will check. We can suggest this improvement if it does not |
@tdcmeehan do you want to review this? Or know anyone else who might want to? |
Nice! But can this increase S3 operation cost if the buffer size is too small? |
@jaystarshot You're correct that it can increase the costs by a fair margin. The minimum part size for S3 is 5MB. I put together a spreadsheet with the cost calculations for uploading 1PB to S3 with a fixed file size of 256MB and varying min part sizes: When using a small part size, the cost for uploading to S3 can be significantly higher than standard PUT requests. However, you also need to consider that previously, we deferred to the TransferManager class, which makes the decision on whether or not an upload should be a PutObject or Multipart, so I'm not 100% sure how often we already do multipart uploads with the existing implementation. Additionally, if you're uploading data at this scale, the monthly storage cost is going to far outstrip the cost of uploading the data in the first place. If you were to store 1PB in S3 and use the minimum part size of 5MB, the one-time upload cost is just 5.13% of the monthly storage cost, linearly decreasing with increases in part size. The data from the chart is in the following spreadsheet: https://docs.google.com/spreadsheets/d/1DNNpfbahHAiwcrBnhxaGlgexAhraVV9_yu9VUpxez3E/edit?usp=sharing |
Sounds good, but for temporary tables it might make more sense to have a bigger buffer size which is anyways configurable i guess. |
This changes forgoes using the original method of using the AWS SDK TransferManager and instead uses the low-level AWS client API in order to support file uploads to S3 which are greater than the machine's disk size. Originally, In order to write a file to S3, the FS implementation would write the entire file to temporary storage on the local disk, then upload the file via the SDK TransferManager API. This meant that the machine is required to have disk space available up to the size of the file being written. For creating large tables through CTAS, you can easily have a file which exceeds the disk size, or you may not provision large disks. This change attempts to upload the file in chunks, similar to a BufferedOutputStream, but uploading each part to S3 through a multipart upload, deleting the data on the local disk once each part is uploaded. This bounds the disk space required. Additionally, this change boosts performance when writing due to properly overriding each method on FilterOutputStream
8905ffa
to
6ca7ffa
Compare
Description
This changes forgoes using the original method of using the AWS SDK
TransferManager
and instead uses the low-level AWS client API in order to support file uploads to S3 which are greater than the machine's disk size.Motivation and Context
Originally, In order to write a file to S3, the FS implementation would write the entire file to temporary storage on the local disk, then upload the file via the SDK
TransferManager
API. This meant that the machine is required to havedisk space available up to the size of the file being written.
For creating large tables through CTAS, you can easily have a file which exceeds the disk size, or you may not provision large disks. This change attempts to upload the file in chunks, similar to a
BufferedOutputStream
, but uploading each part to S3 through a multipart upload, deleting the data on the local disk once each part is uploaded. This bounds the disk space required.Impact
No user-facing API impact. Users might see better perf writing through CTAS
Test Plan
PrestoS3FileSystem
I also ran a few benchmarks to determine how the minimum part upload size effects performance. I used tpch SF 100 data and wrote it to a table using
CREATE TABLE benchmark as SELECT * from orders
on a 2-node cluster on r5.xlarge instances (4vCPUs, 32G memory). The data was read from a pre-existing table in S3. The resulting files were about 6.8GiB in size with atask.writer-count
of 1.The additional performance gain is due to the fact that the original implementation relied on extending
FilterOutputStream
. In the default implementation of theFilterOutputStream
, all writes end up occurring through thewrite(byte b)
interface - See implementation here meaning we were writing a single byte at a time in a tight loop. This change now allows multi-byte copies by overriding each variant of thewrite()
method properly.The
s3-write-override
branch's only change was overriding thewrite
methods ofPrestoS3OutputStream
. This shows the performance difference between this implementation and if we had only fixed that.And for clarification on the
master
ands3-write-override
minimum part size of 5MB: this part size is not necessarily used in the upload. TheTransferManager
API is supposed to calculate an optimal part size based on the file size. So the part size actually used is likely to be different than 5MB.Contributor checklist
Release Notes