New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage:s3): multi-part upload: upload parts concurrently #272
base: main
Are you sure you want to change the base?
Conversation
a78f09a
to
d8d9f7d
Compare
storage/s3/src/main/java/io/aiven/kafka/tieredstorage/storage/s3/S3MultiPartOutputStream.java
Outdated
Show resolved
Hide resolved
6cb6301
to
3db7869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more thoughts
storage/s3/src/main/java/io/aiven/kafka/tieredstorage/storage/s3/S3MultiPartOutputStream.java
Outdated
Show resolved
Hide resolved
3db7869
to
0c67b5c
Compare
storage/s3/src/main/java/io/aiven/kafka/tieredstorage/storage/s3/S3MultiPartOutputStream.java
Outdated
Show resolved
Hide resolved
9088b6f
to
fb3f9df
Compare
fb3f9df
to
2254ec1
Compare
...ge/s3/src/test/java/io/aiven/kafka/tieredstorage/storage/s3/S3MultiPartOutputStreamTest.java
Outdated
Show resolved
Hide resolved
2254ec1
to
a0bcd0e
Compare
708cc14
to
004908c
Compare
004908c
to
8152054
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach has a drawback: we can't realistically use big part sizes. We need to target some 100 MB part sizes. With uncontrollable parallelism + memory arrays, this is not going to end well. Mainly we need to get rid of memory arrays and stream data directly from disk.
Ideally we should have something like this:
- Split the file into parts, virtually, just ranges.
- Upload parts in parallel from file-based InputStreams (applying transfomations, of course).
- Control parallelism, we need an explicit configurable number.
- Recombine the chunk index in the end, should be simple arithmetics.
- Some mechanism for cancelling all parallel uploads if one of them fails (maybe we don't need this if S3 starts rejecting uploads promptly enough after the multipart upload has been cancelled).
This will lead to changing some interfaces + making it required that the chunks size is a multiple of the chunk size.
Uses
CompletableFuture
to run upload parts requests concurrently, and improve upload performance.Resolves: #125