Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new srcset parsing breaks on sentry-docs #1190

Open
untitaker opened this issue Jul 31, 2023 · 2 comments
Open

new srcset parsing breaks on sentry-docs #1190

untitaker opened this issue Jul 31, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@untitaker
Copy link
Collaborator

untitaker commented Jul 31, 2023

#1160 broke running lychee on sentry-docs. Found by bisecting 434baa8..HEAD.

In order to repro, check out sentry-docs, run yarn && yarn build and run lychee --offline public/, or use the attached archive.

I tried to make a minimal example, but I failed to do so because I ran into a different, much older bug.

Apparently lychee ignores absolute paths under some circumstances:

$ cat index.html
<img src="/trace-view.png" srcSet="/trace-view.png 300w" />

$ lychee index.html  # or lychee .
  0/0 ━━━━━━━━━━━━━━━━━━━━ Finished extracting links                                                      🔍 0 Total (in 0s) ✅ 0 OK 🚫 0 Errors

this is mysterious to me because sentry-docs uses plenty of absolute URLs.

Is there some nuance to how lychee treats absolute paths?


About the archive.

I had to figure out how to attach 500MB of html to a github issue when github would only allow 25 MB uploads.

The solution I came up with was to make all non-html files empty (which does not affect link-checking), and compressing the file with zstd -9 instead of gzip to get from 50MB to 7MB.

since github does not allow any upload other than .tgz, the file extension is incorrect too. you can't extract this with a regular archival tool, you need to use:

zstdcat public.tgz | tar xz

this will create a public/ folder in the cwd.

Download archive here: public.tgz

@mre
Copy link
Member

mre commented Aug 1, 2023

Ouch, sorry to hear that.

Is there some nuance to how lychee treats absolute paths?

You need to set the base directory with --base:

> echo '<img src="/trace-view.png" srcSet="/trace-view.png 300w" />' > index.html
> lychee --dump index.html
> lychee --dump --base . index.html
file:///private/tmp/trace-view.png
file:///private/tmp/trace-view.png

I managed to reproduce the bug locally.

Before (latest released version):

> lychee --offline --base . --dump platforms/apple/dsym/index.html | grep static | sort                                                                                                                                                                        
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/0a47e/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/5a46d/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/c1b63/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/c1b63/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/c830c/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/c830c/integration-tokens.png

After (current master):

> lychee --offline --base . --dump platforms/apple/dsym/index.html | grep static | sort
file:///private/tmp/sentry/platforms/apple/dsym/static/49827f66243ac136dced551c15e979aa/0a47e/integration-tokens.png
file:///private/tmp/sentry/platforms/apple/dsym/static/49827f66243ac136dced551c15e979aa/c1b63/integration-tokens.png
file:///private/tmp/sentry/platforms/apple/dsym/static/49827f66243ac136dced551c15e979aa/c830c/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/5a46d/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/c1b63/integration-tokens.png
file:///private/tmp/sentry/static/49827f66243ac136dced551c15e979aa/c830c/integration-tokens.png
file:///private/tmp/sentry/static/chatbot-61b09f737259fcb58e92a8c1ce9961cd.svg

The HTML snippet in question (simplified):

<img src="/static/49827f66243ac136dced551c15e979aa/c1b63/integration-tokens.png" srcSet="/static/49827f66243ac136dced551c15e979aa/5a46d/integration-tokens.png 300w,/static/49827f66243ac136dced55
    1c15e979aa/0a47e/integration-tokens.png 600w,/static/49827f66243ac136dced551c15e979aa/c1b63/integration-tokens.png 1200w,/static/49827f66243ac136dced551c15e979aa/c830c/integration-tokens.png 1754w" />

You can see that it uses the correct path in the first URL of the srcset, but not the remaining ones.

What's weird is that I only touched the extraction logic. The base path handling happens later in the pipeline.
Something's odd.

Will look into fixing that, when I find the time, unless someone else beats me to it.
In the meantime, I recommend using the latest version from crates.io.

Thanks for the report and the archive. It helped troubleshooting the issue.

@mre mre added the bug Something isn't working label Aug 2, 2023
@mre
Copy link
Member

mre commented Jan 5, 2024

@untitaker, I was not able to fix this before 0.14.0. This is unfortunate, but I didn't want to delay the release any longer. Here's hoping that we can fix the regression in the next cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants