Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues serving via S3 static website #1611

Open
cs1jmc opened this issue Nov 21, 2023 · 5 comments
Open

Issues serving via S3 static website #1611

cs1jmc opened this issue Nov 21, 2023 · 5 comments

Comments

@cs1jmc
Copy link

cs1jmc commented Nov 21, 2023

I've run into an issue when trying to then pull packages from a bucket backed static site, but can't tell if the issue is my config a change in static sites behaviour (and how pip deals with it)

WARNING: Skipping page http://<bucket name.region>.amazonaws.com/mirror/web/simple/pillow/ because the GET request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
ERROR: Could not find a version that satisfies the requirement pillow (from versions: none)
ERROR: No matching distribution found for pillow

I notice that curling /web/simple/<package>/ returns a 302 which leads me to think this is more of a static site / pip handling issue that would affect the bandersnatch implementation:

<html>
<head><title>302 Moved Temporarily</title></head>
<body>
<h1>302 Moved Temporarily</h1>
<ul>
<li>Code: Found</li>
<li>Message: Resource Found</li>
<li>RequestId:</li>
<li>HostId:</li>
</ul>
<hr/>
</body>
</html>

My current deploymend of bandersnatch uses this template below as the base for the configuraiton:

[mirror]

directory = /{{ s3_bucket_name }}/{{ s3_file_prefix }}
storage-backend = s3
diff-file = /{{ s3_bucket_name}}/{{ s3_file_prefix }}/{{ s3_diff_file }}

json = false
master = https://pypi.org
timeout = 60
hash-index = false
workers = 6
stop-on-error = false
delete-packages = true

[s3]

region_name = {{ aws_region }}
aws_access_key_id = {{ s3_access_key }}
aws_secret_access_key = {{ s3_secret_key }}
endpoint_url = {{ s3_endpoint_url }}
signature_version = s3v4

[plugins]
enabled =
    exclude_platform
    allowlist_project

[blocklist]
platforms =
    macos
    freebsd

[allowlist]
packages =
    {%+ for package in package_allowlist -%}{{ package }}
    {% endfor %}

I'm wondering if this is misconfig on my part or maybe recent change on AWS side that just breaks this design.

@cooperlees
Copy link
Contributor

This is definitely a serving configuration issue. You need to make the Content-Type: s3 HTML headers send text/html if you're serving a index.html or application/vnd.pypi.simple.v1+json if you're seeing the json file to make pip happy ...

My quick search (linked above) says there is no default and you're somehow sending Content-Type: binary/octet-stream. So correcting that should help fix the issue.

I'm happy to take documentation updates to https://bandersnatch.readthedocs.io/en/latest/storage_options.html#amazon-s3 - Source file if you feel our docs are lacking. I've sadly never setup a S3 based mirror so can not help much more here.

@cs1jmc
Copy link
Author

cs1jmc commented Nov 27, 2023

I've taken a second look at things with a fresh pair of eyes. Think you pointed in the right direction with the Content-Type.

From what I can tell the bandersnatch s3 plugin isn't specifying a Mime type when doing a PutObject to S3, which results in AWS giving the object the default of binary/octet-stream:

aws s3api head-object --bucket <bucketname> --key web/simple/index.html
{
    "AcceptRanges": "bytes",
    "LastModified": "2023-11-27T12:24:18+00:00",
    "ContentLength": 422,
    "ETag": omitted,
    "ContentType": "binary/octet-stream",
    "ServerSideEncryption": "AES256",
    "Metadata": {}
}

From some surface level digging it looks like S3Path is being used to get the files to S3 and there's conversation about passing the Content-Type as a parameter in an existing issue:

liormizr/s3path#83 (comment)

I sadly lack the talent and knowledge on bandersnatch to know how to go about fixing things. (If what I mention sounds right)

@cooperlees cooperlees reopened this Nov 27, 2023
@cooperlees
Copy link
Contributor

Ahh, it seems if it is set @ upload / write time, then this is indeed a bandersnatch bug. Nice find.

I'm asking on the issue if there are plans for a friendlier API and how do we edit existing files ContentType ...

@LeoQuote
Copy link
Contributor

LeoQuote commented Dec 3, 2023

You can use a CDN to provide service, which could be cheaper and content-type can also be changed

Use https://github.com/pottava/aws-s3-proxy and nginx to set content-type if you're using this for internal use only.

@inthecloud247
Copy link

I also encountered this bug in the s3 server... until it's fixed I had to do a recursive fix of the content-types of the index.html pages in my bucket:

aws s3 cp \
       s3://MY_BUCKET/data/web/simple/ \
       s3://MY_BUCKET/data/web/simple/ \
       --exclude '*' \
       --include '*.html' \
       --no-guess-mime-type \
       --content-type="text/html" \
       --metadata-directive="REPLACE" \
       --recursive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants