You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, we use ObjectStore::put_multipart() which uses fixed 10MB chunks to write to object stores. Object stores limits part size of 10,000 parts, so this means we have a 100GB object size limit. For some use cases this is too small.
There's two ways we could solve this:
Make the part size configurable. This would work, but it would require manual configuration provided by the user, which is less than ideal.
Make the part size variable, and increasing linearly. This is the algorithm PyArrow used to use, and it would allow much larger part sizes with no special configuration. The only downside is that R2 would not work with this, as it requires exactly equal part sizes.
IMO the best solution would be 2, but with a carve out for R2 (which will used fixed-size uploads). We already want to have carveouts for R2 and Minio for commit mechanism (see #2246).
In order to implement this, we'll need to use the low-level MultiPartStore API. We'll need to upgrade to a similar API anyways in object_store 0.10.0, since the return value for put_multipart has changed. 1
Possible follow up, when we switch APIs, we'll have more control over retries for uploading a part. Maybe we should retry the Connection reset by peer with increased backoff?
This is an interim fix for just GCS to solve
#2247
Because of the challenges of casting between `&dyn ObjectStore` and
`&dyn MultiPartStore`, it's not easy in the current version of
`object_store` to implement this generically over all stores. However,
in the next version of `object_store` (0.10.0), there is a new API for
`put_multipart()` that will make it easy to extend this implementation
to all stores.
Right now, we use
ObjectStore::put_multipart()
which uses fixed 10MB chunks to write to object stores. Object stores limits part size of 10,000 parts, so this means we have a 100GB object size limit. For some use cases this is too small.There's two ways we could solve this:
IMO the best solution would be 2, but with a carve out for R2 (which will used fixed-size uploads). We already want to have carveouts for R2 and Minio for commit mechanism (see #2246).
In order to implement this, we'll need to use the low-level MultiPartStore API. We'll need to upgrade to a similar API anyways in
object_store
0.10.0, since the return value forput_multipart
has changed. 1Footnotes
https://docs.rs/object_store/latest/object_store/trait.MultipartUpload.html ↩
The text was updated successfully, but these errors were encountered: