Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETag on artifacts seems inconsistent #15387

Open
thatch opened this issue Feb 15, 2024 · 3 comments
Open

ETag on artifacts seems inconsistent #15387

thatch opened this issue Feb 15, 2024 · 3 comments
Labels
bug 🐛 CDN/network Issues related to our CDN, users having problems connecting to PyPI

Comments

@thatch
Copy link

thatch commented Feb 15, 2024

Describe the bug
I'm trying to issue Range requests for some bytes out of pypi wheels. The ETag response header is not constant, and I think it should be.

The immediate application is some code to extract metadata (yes, I know it's being backfilled), but would like to use this same code as a stopgap to enable targeted malware scanning of large wheels without downloading the whole thing.

I need to figure out if this is something that can be fixed, or if I need to work around changing etags by sniffing the Cache-Control: immutable or something.

Expected behavior
Requesting ranges from the beginning and end of wheels returns the same etag and content-type.

To Reproduce

$ curl -I https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl 
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream

$ curl -I -H "Range: bytes=0-1" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
...
etag: "40d570a6b0f1aa3c1de02d4bf7b8e3ae-2"
content-type: binary/octet-stream

$ curl -I -H "Range: bytes=10000000-10000001" https://files.pythonhosted.org/packages/88/8a/76a32ba459b4c376cc3780dca0f600bbbb63b3610249a068f7eb20991ee3/pandas-2.2.0-cp310-cp310-macosx_10_9_x86_64.whl
etag: "a0ee78f286e2d5f37a25879378e0c316"
content-type: application/octet-stream

The last-modified dates also differ by a few seconds, and the server and x-amz-server-side-encryption response headers are only present when requesting the beginning of the file. The cutoff appears to be at 3145728 bytes (3 * 2**20).

I'm assuming this is either bad cached content (if the etag has definitively changed) or a misbehavior in s3 (the docs are unclear on whether it can change over time, and my read of the mozilla writeup on etags is that it shouldn't depend on the range). Happy to work around it if you have ideas for doing so safely. The caching layer doesn't appear to support If-Range requests so I went down the road of verifying them myself.

@thatch thatch added bug 🐛 requires triaging maintainers need to do initial inspection of issue labels Feb 15, 2024
thatch added a commit to hdeps/hdeps that referenced this issue Feb 15, 2024
Either the blob storage or caching layer for files.pythonhosted.org is serving
inconsistent etags for requests before and after the 3MB mark, so disable for
now.  More context is at pypi/warehouse#15387
@ewdurbin
Copy link
Member

I cannot find docs on the expected contents of our B2 backends ETag, but S3 is pretty explicit

ETag
The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
...

In this case, I did confirm that the stored objects are identical at least.

It is clear in your case that some responses are coming from S3 and some from B2 which leads to the ETag mismatch. Due to how our caches are populated there is a time where objects (or segments of objects) may be pulled from S3 before B2 has the object.

It doesn't appear that there is any header which carries a common value we could use to ensure consistency of ETag across resources.

Should we just stop serving the header to avoid confusion?

@dstufft
Copy link
Member

dstufft commented Feb 15, 2024

I think we can set our own ETag in vcl_fetch, probably right here https://github.com/pypi/infra/blob/main/terraform/file-hosting/vcl/files.vcl#L154

We would just need a value to set that ETag too. We have a hash in the URL we could extract and use, or even sillier:

set beresp.http.ETag = digest.hash_md5(req.url);

Or we could add a value here: https://github.com/pypi/warehouse/blob/main/warehouse/forklift/legacy.py#L1455-L1459 which would be available as beresp.http.x-pypi-whatever (I think? I don't see where the pypi- prefix is being added, but we're referencing it here: https://github.com/pypi/infra/blob/main/terraform/file-hosting/vcl/files.vcl#L589).

@dstufft
Copy link
Member

dstufft commented Feb 15, 2024

I also think it's fine to remove ETag, we have cache headers that say to cache these responses for effectively forever, so I don't think we should be getting revalidate checks very often.

@miketheman miketheman added CDN/network Issues related to our CDN, users having problems connecting to PyPI and removed requires triaging maintainers need to do initial inspection of issue labels Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 CDN/network Issues related to our CDN, users having problems connecting to PyPI
Projects
None yet
Development

No branches or pull requests

4 participants