Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ETag headers #451

Closed
daniellehanks opened this issue May 25, 2021 · 25 comments · Fixed by #489
Closed

Support ETag headers #451

daniellehanks opened this issue May 25, 2021 · 25 comments · Fixed by #489
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@daniellehanks
Copy link
Contributor

Split from this issue.

Requesting support for If-Match/If-None-Match headers for GCS via the Python client library.

My use case:
My API is a thin layer over GCS (for storing profile images), and I only want to actually read the image data and serve it to the client if the actual content is different from their cache. They will specify an If-None-Match header to my API (as they won't know anything about GCS or generation numbers, kind of the point), so I just need to essentially pass this along to GCS. I don't want to either do the extra round tripping of reading the current file to check the ETag and then use the generation number from that in a subsequent request, or not use the client library.

@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label May 25, 2021
@tseaver tseaver added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label May 25, 2021
@daniellehanks
Copy link
Contributor Author

My team is about at the point where we are needing this. We would love to consider volunteering a contribution if we could get a design/owner partner on the Google side. Would that be something you all would consider? TIA @tseaver

@tseaver
Copy link
Contributor

tseaver commented Jun 7, 2021

@andrewsg, @frankyn, @cojenco PTAL.

@tritone
Copy link
Contributor

tritone commented Jun 7, 2021

Thanks for the proposal @daniellehanks (and sorry for the slow response). We don't support these via most of the other GCS client libraries and I am following up with the service team to determine if there's a good reason for that. Will update you soon when I have more information.

@tseaver
Copy link
Contributor

tseaver commented Jun 25, 2021

@daniellehanks Can you clarify the usecase, here? Do you want to pass an etag (a list, for If-None-Match, technically) to Blob.reload? Or Blob.exists? Or do you want to pass it to e.g. Blob.download_to_file?

What would you expect to happen in the case that there was no different ETag? Blob.exists could return False, while we'd be more likely to raise an error for the 304 Not Modified response for Blob.reload or Blob.download_*.

@daniellehanks
Copy link
Contributor Author

Can it just behave exactly like the meta/generation parameters? Those raise google.api_core.exceptions.NotModified and google.api_core.exceptions.PreconditionFailed, which are exactly what I would expect.

As for my specific use case, it would likely look something like:

from google.api_core import exceptions
# Boilerplate instantiation of client/bucket omitted.
# Constructing filename and etag omitted, but they come from my client's request to my backend.
...
blob = _bucket.blob(filename)
try:
    # We may use one of the other download methods.
    data = blob.download_as_bytes(if_etag_not_match=etag)
except exceptions.NotFound:
    # Return 404 response to my client.
except exceptions.NotModified:
    # Return 304 response to my client.
except exceptions.GoogleAPIError:
    # Return 500 response to my client.
    
# Return 200 with data to my client with blob.etag as etag. Though it looks like I have to get the etag as a separate call to `reload`? Trying to do this in a single call.

Without the etag functionality:

blob = _bucket.blob(filename)
try:
    blob.reload()
    if blob.etag == etag:
        # Return 304 response to my client.
    # Knowing the etag matches, we can use the generation from the same metadata.
    # We may use one of the other download methods.
    data = blob.download_as_bytes(if_generation_not_match=blob.generation)
except exceptions.NotFound:
    # Return 404 response to my client.
except exceptions.NotModified:
    # This shouldn't happen since we checked the etag earlier.
    # Return 304 response to my client.
except exceptions.GoogleAPIError:
    # Return 500 response to my client.
    
# Return 200 with data to my client with blob.etag as etag.

As I went through the exercise of writing this prototype, it looks like calling reload() is the only way to get the metadata of the current object (e.g. etag)? If that is the case, then both solutions above have to call reload() at some point, though the first one with the proposed etag feature would only have to call it if we didn't get a 304. Though the second one doesn't have to call download if the etag matches. Sounds like a wash if I'm understanding everything correctly and assuming fetching metadata and fetching file contents, even if 304 or 412, are all class B operations.

@tritone
Copy link
Contributor

tritone commented Jun 29, 2021

Thanks for the example, and apologies for the slow response.

As you note from your example, this is only going to save you network calls if you are able to store the ETag client-side and will have it on hand when you attempt to download the file. Will this be possible for your use case? Otherwise, I don't see how this will be useful at all -- obviously the ETag (and generation number) you get from blob.reload will still be current when you make your download call immediately after that.

Assuming this is the case, I discussed with the team and we are open to accepting a PR that adds an if_none_match kwarg to the blob.download_* methods only (we'd rather not add it more broadly for now). This would take the value and add it as a header to the outgoing request, and (as you say) raise a google.api_core.exceptions.NotModified if we get a 304 from the server. It'd also require adding a system test to validate that this works correctly for a matching and non-matching ETag. Does this sound reasonable to you?

Alternatively, you can also potentially store the generation number locally when the object is first created or accessed, and follow the same pattern with that using the existing if_generation_match.

@daniellehanks
Copy link
Contributor Author

I'm not storing the etags in my backend. The client would be caching them and using them to make conditional requests to my API. My backend is basically just a middle man abstracting the storage layer (and the GCS-specific concept of generations in favor of etags).

Is my understanding correct that there is no way to get the etag from GCS via a single call to one of the download_* functions? Both solutions above make 1 network call in the match case (the first to download_* conditional on etag, the second to reload) and 2 in the non-match case (the first to get the etag which actually isn't even guaranteed to match what we just downloaded as there may have been a concurrent modification, the second to actually do the download).

If my above understanding is correct (that the GCS APIs, not just the client libraries, don't support getting the file content and metadata in a single call), then seems like implementing the etag functionality doesn't actually buy me anything in terms of number of network calls. I guess I had just assumed (incorrectly) that the metadata on the blob would be populated on a call to download_* (and, indeed, some fields like hash and crc32 are, but those are based on the file content I presume).

@daniellehanks
Copy link
Contributor Author

One more question to clarify: Is a conditional call to download that results in a 304 or 412 still billed as a class B operation? If they are not billed, then adding the etag functionality could save us a lot of billed operations.

@tritone
Copy link
Contributor

tritone commented Jun 29, 2021

Ah, so, if I understand correctly:

  • For the first time downloading the file you will download without any precondition. You'll take the ETag or generation from the response and cache it.
  • For subsequent downloads you'll use the ETag or generation that you had from the initial download to set a precondition, and update the cached data for downloads that succeed.

Is that accurate?

I think I was also confused about the blob data getting updated from the download_* methods -- you definitely would need this for preconditions (generation or ETag) to provide any value here. I verified that the JSON API does return certain metadata including ETag and generation in the response headers for GET requests with alt=media. I work mostly in Go, and in that GCS library we expose this data to the user via ReaderObjectAttrs. @tseaver @andrewsg @cojenco any idea why we don't have a way of exposing this in Python? Would it be reasonable to add this for attributes which are available via the response headers?

As far as your billing question, I would have to do some research to follow up on this, but I would suspect that you are still charged for the class B operation (but not for the data egress that a successful request would entail obviously).

@daniellehanks
Copy link
Contributor Author

Yes, your summary of what I am doing is correct.

I found on the billing page that 307, 4XX, and 5XX are generally not billed, but that doesn't include 304 so I would assume same as you that the operation is billed but no network egress.

I wasn't able to quickly curl GCS in between meetings today because auth is hard :). Glad you were able to verify the metadata is in the headers on alt=media requests. Can I amend that to my feature request? :D.

If etag (and whatever else is available in the headers that already has an existing property on the blob class that can be easily mapped) could be populated on the download call, and conditional etag support were added to the download methods, that would reduce a call in the case of a not-match as well as probably simplify the code a decent amount. If it's not too much work (like ~<500 LOC or a weekend), I'd love to contribute.

@daniellehanks
Copy link
Contributor Author

I think there is merit to just updating the metadata on download all by itself. Getting metadata via a separate network request is prone to concurrent modification errors/race conditions. I.e. if I have to download the file and get the etag as two separate requests, I'm not guaranteed that the etag actually belongs to the file contents I got as the file could have been overwritten between calls.

@tritone
Copy link
Contributor

tritone commented Jun 30, 2021

Sounds good! Yeah I think it would make sense to start with a PR to add the metadata available from headers on download to the blob. Then we can do a second PR with ifNotMatch if you think it's needed (but I think using generation should work as well assuming we can expose it?).

When I do a GET in the JSON API with alt=media set, here are the fields that I see available from the headers:

  • generation (as X-goog-generation)
  • metageneration (as X-goog-metageneration)
  • etag (as Etag)
  • storage_class (as X-goog-storage-class)
  • X-goog-hash (not sure if this corresponds to crc32c or md5 though)
  • content_disposition, content_type, cache_control

Side note, I recommend https://developers.google.com/oauthplayground/ for trying out requests, it handles the auth complications for you which is very nice!

@tseaver
Copy link
Contributor

tseaver commented Jun 30, 2021

FWIW, etag is not one of the currently extracted properties during downloads: see Blob._extract_headers_from_download, but it would be easy to add there.

@tritone
Copy link
Contributor

tritone commented Jun 30, 2021

Ahh, thanks for the pointer @tseaver. So actually this is pretty straightforward-- we should extract all of generation, metageneration, and etag there as well.

@tseaver
Copy link
Contributor

tseaver commented Jun 30, 2021

@tritone Yup.

@tseaver
Copy link
Contributor

tseaver commented Jun 30, 2021

@tritone, @daniellehanks Note that an if_none_match parameter should probably take a sequence of ETags, and convert them into a comma-separated string. Depending on expected usage, it might allow passing a string as well.

@daniellehanks
Copy link
Contributor Author

@tritone I don't want to use generation as that is GCS specific. Like I said, my backend is just a middle man and my mobile client is the one caching the result and I want that to be implementation detail (GCS) agnostic, i.e. use standard etag for caching.

Starting with the PR to pipe through the headers on download calls sounds good. I'll try to find some time to work on that this week or next.

@daniellehanks
Copy link
Contributor Author

@tritone, @daniellehanks Note that an if_none_match parameter should probably take a sequence of ETags, and convert them into a comma-separated string. Depending on expected usage, it might allow passing a string as well.

@tseaver the generation match parameters only take a single value, not a sequence. Seems like the two should behave consistently?

@tseaver
Copy link
Contributor

tseaver commented Jun 30, 2021

@daniellehanks RFC 7232 specifies the semantics of the If-None-Match HTTP header:

   The "If-None-Match" header field makes the request method conditional
   on a recipient cache or origin server either not having any current
   representation of the target resource, when the field-value is "*",
   or having a selected representation with an entity-tag that does not
   match any of those listed in the field-value.

@daniellehanks
Copy link
Contributor Author

It looks like the if-match/if-not-match for generations in the JSON API are implemented as query parameters and only seem to allow a single value (documentation). I will align with the etag spec and use a sequence for those. Pointers as to where to dive in in the code for adding the etag conditions? Or should I just search for generation and parallel with etags?

@tseaver
Copy link
Contributor

tseaver commented Jun 30, 2021

  • As @tritone noted, we'll need to update Blob._extract_headers_from_download to extract the ETag, x-goog-generation, and x-goog-metageneration values into the appropriate properties.
  • Blob._do_download already takes headers, so you don't need to touch it.
  • Client.download_blob_to_file currently computes the headers based on the blob's encryption key. Adding the If-None-Match header to that set (after the call to _get_encryption_headers) is the right spot, when if_none_match is not None. Tests will need to check the mock for blob._do_download.
  • All the Blob.download_* methods are now wrappers around Client.download_blob_to_file -- you'll just need to plumb through the if_none_match argument. Unit tests will just need to check the mock for Client.download_blob_to_file.
  • I would add the system test that @tritone requested in tests/system/test_client.py, exercising storage_client.download_blob_to_file, first uploading a blob, then passing an invalid ETag, then passing the current ETag from the blob.

@daniellehanks
Copy link
Contributor Author

@tseaver and @tritone #488 for part 1. I will start work on part 2 in a separate branch of my fork.

@daniellehanks
Copy link
Contributor Author

Please correct me if I'm wrong, but it appears the upload APIs don't support if-match/if-none-match headers. Since the scope of this request was for download functionality, this doesn't inhibit my use case. Just something to be aware of that etags will not fully parallel generations.

@daniellehanks
Copy link
Contributor Author

Draft PR #489 for conditional reads on etag. I just need to add the system test(s) and then I will remove draft status.

@tseaver
Copy link
Contributor

tseaver commented Jul 6, 2021

@daniellehanks, @tritone I split out the missing header->property population as #490.

gcf-merge-on-green bot pushed a commit that referenced this issue Jul 8, 2021
Support conditional requests based on ETag for read operations (`reload`, `exists`, `download_*`). My own testing seems to indicate that the JSON API does not support ETag If-Match/If-None-Match headers on modify requests (`patch`, `delete`, etc.), please correct me if I am mistaken.

This part two of #451. Part one in #488.
Fixes #451 🦕
cojenco pushed a commit to cojenco/python-storage that referenced this issue Oct 13, 2021
Support conditional requests based on ETag for read operations (`reload`, `exists`, `download_*`). My own testing seems to indicate that the JSON API does not support ETag If-Match/If-None-Match headers on modify requests (`patch`, `delete`, etc.), please correct me if I am mistaken.

This part two of googleapis#451. Part one in googleapis#488.
Fixes googleapis#451 🦕
cojenco pushed a commit to cojenco/python-storage that referenced this issue Oct 13, 2021
Support conditional requests based on ETag for read operations (`reload`, `exists`, `download_*`). My own testing seems to indicate that the JSON API does not support ETag If-Match/If-None-Match headers on modify requests (`patch`, `delete`, etc.), please correct me if I am mistaken.

This part two of googleapis#451. Part one in googleapis#488.
Fixes googleapis#451 🦕
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants