Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to force html mode #673

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nacnudus
Copy link

@nacnudus nacnudus commented Jul 1, 2022

This is an attempt to address #671 by adding a flag --html to parse
the input as HTML. Otherwise, STDIN and local files without the .html
suffix are parsed as plain text.

This is an attempt to address lycheeverse#671 by adding a flag `--html` to parse
the input as HTML.  Otherwise, STDIN and local files without the `.html`
suffix are parsed as plain text.
@mre
Copy link
Member

mre commented Jul 7, 2022

Thanks for pushing this forward @nacnudus.

Some questions:

  • What if a user wants to check different file types in a single run? E.g. one Markdown file and one HTML file. Would the --html flag always override the file extension?
  • Are there any other tools that have a similar feature? I know that ripgrep has a --type option, which allows a user to filter input files by type (but not parse/interpret the input files differently). That's the closest I could find. (A precedent would help us make the best design decision for lychee.)

We could also change the option to --input-type <format>, which is a bit more flexible. E.g. --input-type markdown would enforce the Markdown parser instead. It still would not work for different input types, though, but that might be fine.

Alternatively, we could just add a section to the TROUBLESHOOTING.md file with a workaround on how to enforce the correct file type by storing the input to a temporary file with the html extension. Not pretty, but we avoid maintaining the option.

Looking forward to your thoughts.

@nacnudus
Copy link
Author

Hi, thanks for your reply, and apologies for my slowness responding.

  • What if a user wants to check different file types in a single run? E.g. one Markdown file and one HTML file. Would the --html flag always override the file extension?

Yes, --html would always override the file extension. I really have unix pipes in mind, which don't have file extensions.

  • Are there any other tools that have a similar feature?

Imagemagick requires an image format specifier to read an APNG image sequence, otherwise it assumes that it is PNG and reads only the first frame.

We could also change the option to --input-type <format>

I agree, that would be better.

Alternatively, we could just add a section to the TROUBLESHOOTING.md file with a workaround on how to enforce the correct file type by storing the input to a temporary file with the html extension.

That wouldn't work for STDIN, but perhaps lychee isn't the best tool for that any, because pipes can't benefit from lychee's multithreading, as far as I know.

@lebensterben
Copy link
Member

currently we relies on extension to determine the file format, and non supported format are treated as plaintext where we simply grab urls.

this process is VERY inefficient due to the underlying implementation of linkify.

instead, I think we can TRY to parse plaintext as html first, should that fail, we fallback to linkify.

this is helpful given the most likely use case of lychee is to validate links in html rather than any other formats.

@mre
Copy link
Member

mre commented Jan 17, 2023

this is helpful given the most likely use case of lychee is to validate links in html rather than any other formats.

I'm not sure if that's actually true. Most users use lychee-action and I wouldn't be surprised if the majority of links are checked in Markdown files.

this process is VERY inefficient due to the underlying implementation of linkify.

Not sure if parsing HTML is faster, either. We'd have to run a benchmark to see if trying HTML first would give us any significant performance gains.

@lebensterben
Copy link
Member

plain text file are text/plain and markdown files are text/markdown so

I think we can TRY to parse plaintext as html first

doesn't apply to mark down files and as such

I wouldn't be surprised if the majority of links are checked in Markdown files

this is irrelevant.

@lebensterben
Copy link
Member

Not sure if parsing HTML is faster, either. We'd have to run a benchmark to see if trying HTML first would give us any significant performance gains.

Parsing HTML is definitely faster than using linkify on normal sized html files.
This is because a parsed html file is a structured data and links will only appear in certain tags.

@HU90m
Copy link
Contributor

HU90m commented Sep 3, 2023

My preference would be a ripgrep/fd style type option, specifically a --default-type option which allows the user to override the behaviour for a file without an extension, e.g. --default-type html.

@mre
Copy link
Member

mre commented Sep 3, 2023

I like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants