Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lift the max object size for non-R2 stores #2247

Open
wjones127 opened this issue Apr 23, 2024 · 1 comment
Open

Lift the max object size for non-R2 stores #2247

wjones127 opened this issue Apr 23, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@wjones127
Copy link
Contributor

Right now, we use ObjectStore::put_multipart() which uses fixed 10MB chunks to write to object stores. Object stores limits part size of 10,000 parts, so this means we have a 100GB object size limit. For some use cases this is too small.

There's two ways we could solve this:

  1. Make the part size configurable. This would work, but it would require manual configuration provided by the user, which is less than ideal.
  2. Make the part size variable, and increasing linearly. This is the algorithm PyArrow used to use, and it would allow much larger part sizes with no special configuration. The only downside is that R2 would not work with this, as it requires exactly equal part sizes.

IMO the best solution would be 2, but with a carve out for R2 (which will used fixed-size uploads). We already want to have carveouts for R2 and Minio for commit mechanism (see #2246).

In order to implement this, we'll need to use the low-level MultiPartStore API. We'll need to upgrade to a similar API anyways in object_store 0.10.0, since the return value for put_multipart has changed. 1

Footnotes

  1. https://docs.rs/object_store/latest/object_store/trait.MultipartUpload.html

@wjones127 wjones127 added the enhancement New feature or request label Apr 23, 2024
@wjones127
Copy link
Contributor Author

Possible follow up, when we switch APIs, we'll have more control over retries for uploading a part. Maybe we should retry the Connection reset by peer with increased backoff?

wjones127 added a commit that referenced this issue Apr 26, 2024
This is an interim fix for just GCS to solve
#2247

Because of the challenges of casting between `&dyn ObjectStore` and
`&dyn MultiPartStore`, it's not easy in the current version of
`object_store` to implement this generically over all stores. However,
in the next version of `object_store` (0.10.0), there is a new API for
`put_multipart()` that will make it easy to extend this implementation
to all stores.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant