Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the types of HTML tag checked selectable and document that all are checked by default #749

Open
mmuehlfeldRH opened this issue Jul 20, 2023 · 3 comments

Comments

@mmuehlfeldRH
Copy link

Summary

If a web site embeds an SVG image by using <embed type="image/svg+xml" src="file.svg">, linkchecker incorrectly interprets the file in "src" as a link.

Steps to reproduce

  1. linkchecker --recursion-level=1 https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/getting-started-with-tipc_configuring-and-managing-networking

Actual result

URL        `images/TIPC-architectural-overview.svg'
Parent URL https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/getting-started-with-tipc_configuring-and-managing-networking, line 4989, col 299
Real URL   https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/images/TIPC-architectural-overview.svg
Check time 2.564 seconds
Size       9KB
Result     Error: 404 Not Found

Expected result

Do not interpret <embed type="image/svg+xml" src="images/TIPC-architectural-overview.svg"> as a link.

Source code snippet of the example page mentioned above

<div class="informalfigure">
  <div class="mediaobject">
    <object type="image/svg+xml" class="svg-img" data="https://access.redhat.com/webassets/avalon/d/Red_Hat_Enterprise_Linux-8-Configuring_and_managing_networking-en-US/images/ddad5eeb9b394bb21e78c9d605c4842b/TIPC-architectural-overview.svg">
      <embed type="image/svg+xml" src="images/TIPC-architectural-overview.svg"><!--Empty--></embed>
    </object>
  </div>
</div>

Environment

  • Operating system: Fedora 38 x86_64
  • Linkchecker version: 10.2.1
  • Python version: 3.11.4
  • Install method: RPM from Fedora repository
@cjmayo
Copy link
Contributor

cjmayo commented Jul 20, 2023

In what way is this incorrect?

All elements that can contain links are checked:

LinkTags = {
'a': ['href'],
'applet': ['archive', 'src'],
'area': ['href'],
'audio': ['src'], # HTML5
'bgsound': ['src'],
'blockquote': ['cite'],
'body': ['background'],
'button': ['formaction'], # HTML5
'del': ['cite'],
'embed': ['pluginspage', 'src'],

@mmuehlfeldRH
Copy link
Author

For me, as a normal web site user, a link is what I can click on an HTML page and that brings me to to a different page or an anchor on the same page (mostly <a href...>). I didn't expect that the tool interprets "link" as "every absolute and relative path in each tag that supports such paths", and it's also not defined in the docs.

How about adding the list with the tags to the man page to clarify what is tested:

CHECKED TAGS

Linkchecker verifies links in a wide range of HTML tags:

Link tags and attributes:
   a: href
   applet': archive, src
   area: href
   audio: src
   ...

Anchor tags and attributes
...

WML tags and attributues
...

Since this list very likely doesn't change frequently, it might be OK to simply copy/paste it to the docs.

What would also be cool: If the user could pass an option with the list of tags to check to linkchecker. For example: -l a,button,del. Of course, this option would only accept tags that are in the list in the source code with their attributes. Then users could decide on their own, and it would significantly speed up testing if a user only cares about links in a few tags.

@cjmayo
Copy link
Contributor

cjmayo commented Jul 21, 2023

link does actually mean the same to me. Just 20+ years of history to look after! And I don't think URLchecker would have been such a great name.

I guess the intent was that every attribute that can take a URL is checked.

Let's keep this ticket as a feature request.

@cjmayo cjmayo changed the title SVG files embedded by using the <embed> tag cause false positive 404 errors Make the types of HTML tag checked selectable and document that all are checked by default Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants