Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize AnchorCheck plugin to avoid that it downloads references to anchors in the same file multiple times #781

Open
mmuehlfeldRH opened this issue Dec 18, 2023 · 2 comments

Comments

@mmuehlfeldRH
Copy link

Summary

If an HTML file contains a lot of links that refer to anchors in the same (or a different) file, linkchecker wastes a lot of time and bandwidth because it downloads the file for each link to the anchor it checks. This isn't necessary and slows down testing significantly.

For example, an HTML file contains 1000 anchors and, additionally, one reference to each of these anchors, then linkchecker downloads the file 1000 times to check all anchors. This take more than 2 minutes.

Steps to reproduce

  1. Create an HTML file that contains 1000 sections with anchors, and a link to each of the anchors. You can use the following script to generate such an HTML file:

    echo "<html><head><title>Demo</title></head><body>" > test.html
    
    for i in `seq 1 1000` ; do
            echo "<h1 id=\"anchor$i\">Example $i</h1>" >> test.html
            echo "<a href=\"#anchor$i\">Link to section Example $i.</a>" >> test.html
    done
    
    echo "</body></html>" >> test.html
    
  2. Store the generated HTML file on a web server.

  3. Ensure that the web server sends a LinkChecker header to prevent that linkchecker throttles the connection.

    $ curl -s --head http://192.0.2.2 | grep LinkChecker
    LinkChecker: allow-concurrent-checks
    
  4. Create /tmp/linkcheckerrc with the following content:

    [checking]
    maxrequestspersecond=1000
    
    [AnchorCheck]
    
  5. Run linkchecker:

    $ time linkchecker -f /tmp/linkcheckerrc --threads=40 --recursion-level=1 --quiet --file-output=csv/utf-8//tmp/output.csv --check-extern http://192.168.0.2/test.html
    

Actual result

Linkchecker downloads the test.html for each link to an anchor within that file again (1000x), which is unnecessary.

40 threads active,   951 links queued,   10 links in 1001 URLs checked, runtime 1 seconds
...
40 threads active,    16 links queued,  945 links in 1001 URLs checked, runtime 2 minutes, 7 seconds

On the web server, you can also see that the file was downloaded 1000 times:

# grep "test.html" /var/log/httpd/access_log | wc -l
1000

Expected result

If multiple referenced anchors are within the same file, it would be much more efficient to download that file only once and perform all anchor checks at once.

For example, linkchecker should test main.html. This file contains 100 links to anchors that are in the same file + 100 links to anchors in external.html, then the following would be efficient:

  1. Aggregate all links with anchors by file name (main.html, external.html)
  2. Check all anchors that refer to the file itself <a href="#..."> (main.html - at this time, the content was already loaded, because that's the file we test)
  3. Download external.html.
  4. Check all anchors that were linked in main.html and reference to an anchor in external.html

Environment

  • Operating system: Linux demo 6.6.6-200.fc39.x86_64 # 1 SMP PREEMPT_DYNAMIC Mon Dec 11 17:29:08 UTC 2023 x86_64 GNU/Linux
  • Linkchecker version: 10.4.0
  • Python version: 3.12.0
  • Install method: Cloned from git repository
@Kristinita
Copy link

Type: Additional information

I ran linkchecker without AnchorCheck on my Windows 11 for internal links of my real project:

Content types: 43 image, 302 text, 0 video, 0 audio, 144 application, 1 mail and 2067 other.
URL lengths: min=16, max=634, avg=71.

That's it. 2557 links in 2557 URLs checked. 0 warnings found. 0 errors found.
Stopped checking at 2024-01-24 08:20:55+003 (2 minutes, 57 seconds)

Then I enabled [AnchorCheck] in my linkcheckerrc and ran linkchecker for internal links of my real project again:

Content types: 43 image, 14195 text, 0 video, 0 audio, 144 application, 1 mail and 2123 other.
URL lengths: min=16, max=996, avg=226.

That's it. 16506 links in 16506 URLs checked. 0 warnings found. 0 errors found.
Stopped checking at 2024-01-24 10:36:25+003 (1 hour, 31 minutes)

91 minutes after 3 minutes is incredibly slow.

In September 2022, pull request #661 “Greatly improve AnchorCheck performance” was opened. I hope it will be reviewed.

Thanks.

@shepherdjerred
Copy link

I have also noticed that my documentation takes a very long time to check despite all of the files being local. They have a lot of anchor links, so I suspect this is the root cause.

With all of that being said, thank you so much to the authors of the anchor check plugin. So few linkchecks support checking anchors..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants