Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetObject doesn't respect the partNumber Request URI parameter #56

Open
amwolff opened this issue Feb 11, 2022 · 6 comments
Open

GetObject doesn't respect the partNumber Request URI parameter #56

amwolff opened this issue Feb 11, 2022 · 6 comments
Labels
blocked bug Something isn't working edge

Comments

@amwolff
Copy link
Member

amwolff commented Feb 11, 2022

GetObjectNInfo needs adjustments to handle the partNumber parameter. This way, we will fully support GetObject action.

Reference: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html#API_GetObject_RequestSyntax

@amwolff amwolff added the bug Something isn't working label Feb 11, 2022
@wthorp
Copy link
Contributor

wthorp commented Feb 11, 2022

At the core we'd have to extend ObjectDownloadRequest / Endpoint.DownloadObject() to be capable of returning from only a specific part. This would guarantee that order limits are set correctly, etc.

It does seem that part numbers are currently being persisted. The SegmentPosition struct is serialized as uint64(pos.Part)<<32 | uint64(pos.Index) and saved in metabase DB's [segments].[position].

@amwolff
Copy link
Member Author

amwolff commented Feb 12, 2022

Here are my findings from a quick test with the "real" S3:

  1. partNumber parameter can be specified even for non-multipart-uploaded objects, but then x-amz-mp-parts-count is not present in the response. This scheme aligns well with imposed by AWS S3 limits, i.e., the maximum part size is 5GB, and the largest object that can be uploaded in a single PUT is 5 GB (so objects like that will only have one part);
  2. If partNumber is specified for multipart-uploaded objects, then x-amz-mp-parts-count is present. I think this means that the client can effectively download an object aligned with how it was uploaded part-by-part.

In short, per GetObject request with this mighty parameter, we will need to gather the appropriate range for the part specified and the number of available parts—if any.

@mniewrzal
Copy link
Contributor

Just an initial question. Is this our finding or is it's a real use case we have with one of external testsuites? I'm asking to understand the priority.

In regards to the solution, @wthorp is right and we would need to extend metainfo.DownloadObject to return the exact range for the specified part. That is the relatively easy part, usually, we have more trouble with defining how to expose this with uplink API. There I think we have 3 options:

  • expose it only for gatewa-mt as private code
  • extend DownloadObjectOptions to include partNumber
  • add new method like project.DownloadPart(ctx, bucket, key, partNumber, options)

If we need to have this fix sooner than later then we can also initially fix it only on uplink side. DownloadObject response at the moment should contain all necessary information to figure out part range in object, so we can use this range to download part without changing server-side. The main issue is that we will allocate bandwidth for the full object but that would be only a temporary solution.

If we agree that this is a high priority we can think to schedule time for it. As task fully for Metainfo team I don't think we will have the power to do this next sprint but if that would be cross-team effort I'm sure we can figure out something.

@amwolff
Copy link
Member Author

amwolff commented Feb 14, 2022

Thanks so much for thorough response! Our test suite and external test suites we test against don't cover this case. This is our find at the moment. I don't think it's very high in priority (certainly good to have, though); this probably needs confirmation, but I believe it could be even scheduled for Q2 (for our compatibility-improvement milestone).

@shaupt131
Copy link

Moved to backlog until S3 compatibility Improvement milestone work is underway.

@wthorp
Copy link
Contributor

wthorp commented Jun 3, 2022

I had thought that this feature (beyond being part of the standard) would result in more optimized parallel downloads.

Most parallel download clients will ask for some arbitrary sized chunk. If these chunk sizes relatively but don't exactly align with segment sizes, we'd effectively require two requests to download each segment. However, this story is worse if downloaders grab small chunks; It could take arbitrarily many metainfo requests to download a segment.

Seeing as how gaining efficiency here would require changing how clients work (which frankly won't happen), this now seems low priority. I can see a cache of some sort being the more effective solution to this problem.

@amwolff amwolff added the edge label Dec 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked bug Something isn't working edge
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

4 participants