Add CI for checking for broken links manually, weekly and in PRs #1633

markcmiller86 · 2023-04-18T07:44:22Z

Resolves: #1431

Links will get checked automatically weekly at 5:17 AM on Sundays. They can also be checked manually by just running the workflow.

We need to adjust the filters used in the checker and there are a lot of broken links found (see #1632)

Set list of Reviewers (at least one).
Add to Project BSSw Internal.
~~[ ] View the modified *.md files as rendered in GitHub.~~
~~[ ] If changes are to the GitHub pages site under the docs/ directory, consider viewing locally with Jekyll.~~
Watch for PR check failures.
Make any final changes to the PR based on feedback and review GitHub (and Jekyll) rendered files.
Ensure at least one reviewer signs off on the changes.
Once reviewer has approved and PR check pass, then merge the PR.

…software/bssw.io into mcm86-check-urls-weekly

bartlettroscoe · 2023-04-18T21:40:47Z

@markcmiller86, please @mention me and let me know when this is ready to review

markcmiller86 · 2023-04-18T21:50:17Z

Apologies...I converted to DRAFT for time being.

markcmiller86 · 2023-04-18T21:59:47Z

Have adjusted it to ignore...

docs -- those links are often not really public...EB should fix but we can do as encountered.
Events -- short lived and links are likely to stale over time

markcmiller86 · 2023-04-19T00:00:53Z

Ok, this addresses all the cases in #1632.

I am pulling out of DRAFT mode. Its ready for review.

markcmiller86 · 2023-04-19T00:02:28Z

@rinkug, @bartlettroscoe and @bernhold this is now ready for review.

The weekly CI check should result in a small list of either bonified broken links or false negatives (e.g. says its broken when its not) to try to check and then either go edit files to fix, add to ignore patterns or just ignore.

markcmiller86 · 2024-02-02T18:07:55Z

And as an alternative to fancy automation, could we post the failed links as a new issue? If the number of bad links is not to large for any one run, we can keep a couple of the issues around and someone can manually compare them for repeats.

Alternatively, we don't do as CI at all (at least not scheduled or maybe even manual...PR checks are still good though). Instead, we use python tools and someone on EB board periodically runs checks with locally available tooling and updates list of failure cases and then attempts to fix truly broken links as they are encountered.

vsoch · 2024-02-02T18:33:20Z

I'm happy to fix the bugs, but as I mentioned I can't work on this during work time. You just pinged me 1-2 days ago so you haven't really given me a chance to work on it. I would suggest adding the URLs to the skip list for now.

markcmiller86 · 2024-02-02T20:09:39Z

I'm happy to fix the bugs, but as I mentioned I can't work on this during work time. You just pinged me 1-2 days ago so you haven't really given me a chance to work on it. I would suggest adding the URLs to the skip list for now.

@vsoch I mentioned you here in this conversation for a different reason...your thoughts and experience, if any, in dealing with and developing a strategy for the larger issue of high probabily of some (of the over 5,500 -- and still growing) links likely always failing for reasons other than the link actually needing to be replaced.

vsoch · 2024-02-02T22:23:53Z

Ah gotcha! If you want to make new additions or changes speeder, then I'd recommend changed files: https://github.com/marketplace/actions/changed-files. I use that for container matrices so I only build containers with updated Dockerfile. If you are concerned about existing links breaking (across many files) a dumb thing I do is to segment a list of things (e.g., paths) into equal lists based on matches hashes to calendar month days, then run for the day (a small subset) each night. https://github.com/vsoch/split-list-action. You probably don't want to be checking everything on every PR, every time!

bartlettroscoe · 2024-02-03T00:47:29Z

I submit that the perfect is the enemy of the good enough. I hope that there are knobs we can tweak to get the failure rate low enough that it is humanly manageable.

Thinking about this some more, I hope that will be the case if we set things up well and have well defined responsibilities (see below)

The urlchecker action has a number of parameters for retries and timeouts and such.

Retries in the same run of urlchecker will not help if a link is broken because of an expired cert or a machine upgrade, etc. Only running urlchecker days later will address that type of issue.

And as an alternative to fancy automation, could we post the failed links as a new issue? If the number of bad links is not to large for any one run, we can keep a couple of the issues around and someone can manually compare them for repeats. I think the results of the URL check could be posted as an issue using actions.

That will just create a bunch of GH issues that people will ignore because most of these will be false failures . This will just clutter up the GitHub issue DB. (I have a lot of experience observing human behavior about reacting to GH Issues.)

You need a process where there are very few false failures. My experience over 20+ years observing human behavior is that if the number of false failures is above say 30%, most people will just assume it could be a false failure and ignore it.

And maybe weekly is too often to run it. How about biweekly or monthly instead, just to reduce the frequency at which someone has to look at the results?

...

Instead, we use python tools and someone on EB board periodically runs checks and updates list of failure cases and then attempts to fix truly broken links as they are encountered.

How do people determine what is "truly broken" without manually clicking on the links in a browser? But I guess if the number of random failures stays low (say less than 10 total on any run of urlchecker on every *.md file) and you only run this check every month or so, then it would not take too much time to manually check these links. And my experience is that most external links stay around for quite some time. It is only older articles with links put in 1+ years ago where you have much of a problem with truly broken links.

So I would suggest something like:

Set up to run urlchecker on all of the *.md files once per month and have it send out emails to the editorial them when it fails (with a link to the GHA output to see what the failures are with some comments in the GHA output about expecting some random link check failures).
Someone on the editorial team will lock over the failures and manually check all of the broken links (e.g. assuming there less than 10 or so).

But who will be responsible for looking at the link check results? (If it is everyone's responsibility, then it is no one's responsibility.)

The above seems like a reasonable process, as long as someone is willing to spend 10 minutes every month looking at potential broken links.

Alternatively, we don't do as CI at all (at least not scheduled or maybe even manual...PR checks are still good though).

It seems that the GHA job will only check links in changed *.md files so I would argue that will not create a lot of false failures (because most *.md files have a few links). So I would argue we should run this GHA check on all PRs (and add some comments to the GHA output about expecting some random link check failures).

bartlettroscoe · 2024-02-03T00:58:52Z

@vsoch, from the discussion above, I don't think there is anything wrong with urlchecker itself or anything that can be done inside of urlchecker to address the fundamental problem discussed above. The problem is that checking for external links and downloading data from those links is inherently going to lead to some number of random false failures because of issues with the external cites themselves (like expired certs). What is needed is a process that runs on top of sampled data produced by a checker like urlchecker where you expect some number of random failures where you run it several times over a period of days or weeks to generate sampled datasets. Then with this set of data collected over a period of time, you have a higher-level analysis tool look over that data for link checks that are consistently failing and then focus only on those. No one expects a checking tool like urlchecker to be able to do that.

markcmiller86 · 2024-02-03T01:02:08Z

No one expects a checking tool like urlchecker to be able to do that.

I agree...That said, @vsoch has add a --no-check-certs option to the checker to address the expried certs failures. Thanks @vsoch 💪🏻. She is also looking into making urlchecker do just head requests to make it faster (and not download the linked content as part of the checking process).

I just enlisted @vsoch for her thoughts on this broader question to see if she might be able to suggest solutions.

bartlettroscoe · 2024-02-03T01:23:31Z

@vsoch has add a --no-check-certs option to the checker to address the expired certs failures.

@markcmiller86, when is this going to be released in a new version of:

https://github.com/urlstechie/urlchecker-action

being used by check-urls.yml?

It looks like there has not been a commit to the master branch of that repo in 2 years going back to urlstechie/urlchecker-action@b643b43.

vsoch · 2024-02-03T02:53:50Z

Ping @davidbeckingsale - I have a many ideas for how to address the issues here but I don't have bandwidth soon to work on it, even in free time - there are just other things I need/want to do first. I think an RSE could help here.

vsoch · 2024-02-03T02:56:28Z

@bartlettroscoe that is my library - the action is just a wrapper for urlchecker-python. And yes, it's been running smoothly for 2 years and hasn't had any feature requests or issues. I'm lead maintainer for these projects, they have about 330 users. Feel free to use something else.

vsoch · 2024-02-03T22:59:17Z

Hi @bartlettroscoe @markcmiller86 I wanted to follow up here - realizing that probably I'm best oriented to work on this, I put aside the work I wanted to do for this afternoon and tackled this issue. I have a new release of urlchecker-python (0.0.35) and a branch with the action that you can test. Importantly:

the test branch now has support for --no-check-certs exposed as no_check_certs under the action
we use a HEAD request and fall back to GET and close the response (thanks for this insight @markcmiller86 I do believe it will be faster for large pages, which tend to be most these days)
I fixed (what I consider to be) a bug that we could not target a single file (you now can, if you install / run it locally)

The main issues had nothing to do with the above, and actually came down to changes in the selenium interface. When I debugged I found this change which meant that we were not using a web driver at all and relying solely on requests. The fact that most were failing is a reflection of a huge change in the web - it used to be the case that most would work for requests, and there was only an off case or two where you needed the driver. Now more sites are able to detect (what comes down to) scrapers, and they don't allow this. So since our webdriver was failing, this was resulting in bad results.

@markcmiller86 could you please test out the branch, with the added option to not check certificates, and let me know your feedback? If/when it is good I can merge that PR and do another release of urlchecker-action. Thanks!

markcmiller86 · 2024-02-03T23:13:00Z

@markcmiller86 could you please test out the branch, with the added option to not check certificates, and let me know your feedback? If/when it is good I can merge that PR and do another release of urlchecker-action. Thanks!

@vsoch this is great news! 🎉 💪. Thanks so much for tackling this ❤️.

I will try to test before end of weekend.

vsoch · 2024-02-04T18:59:06Z

Looks like we can try to suppress the warnings to clean up the output a bit:

import requests
requests.packages.urllib3.disable_warnings()

markcmiller86 · 2024-02-04T19:00:48Z

@vsoch I tested as an action using the @update/0.0.35 moniker to specify action version. I tried with and without no_check_certs. It seems to be working as expected. I can't speak quantitatively to whether it went any faster but it seems to go a bit faster.

vsoch · 2024-02-04T19:01:33Z

I am preparing a branch with tweaks, such as silencing those warnings. Let me know if you want something else tested before I push a test image and we can try again.

markcmiller86 · 2024-02-04T19:02:21Z

Looks like we can try to suppress the warnings to clean up the output a bit:
import requests
requests.packages.urllib3.disable_warnings() 

Hmm...I kinda like those. I mean, one issue I have with most actions is that when things go wrong, there is very little information as to why. Maybe tie this output to your verbose: true variable as well? I think it could be important to actually see cause if desired.

vsoch · 2024-02-04T19:29:09Z

okay, happy not to do any more work then! 😆

Create check-published-links-weekly.yml

16a2593

markcmiller86 requested review from bernhold, rinkug and bartlettroscoe April 18, 2023 07:44

markcmiller86 added 3 commits April 18, 2023 09:47

skip files in utils dirs

d73e0d1

fix some links

90129d9

Update check-published-links-weekly.yml

025bdbd

markcmiller86 changed the title ~~Create check-published-links-weekly.yml~~ Add CI for checking for broken links weekly (and manually) Apr 18, 2023

markcmiller86 added 9 commits April 18, 2023 12:28

fix link

3ebc92e

Merge branch 'mcm86-check-urls-weekly' of github.com:betterscientific…

e895554

…software/bssw.io into mcm86-check-urls-weekly

fix use of link

53412f0

fix link

7d74c7c

fix link

5581fb2

fix link

7ef91f2

fix link;repl link

d795dea

fix links

3fb1404

ignore preview.bssw.io

dfc7877

markcmiller86 marked this pull request as draft April 18, 2023 21:50

markcmiller86 added 3 commits April 18, 2023 14:54

ignore docs dir for now

94276f5

ignore .github dir

f15f23e

ignore Events too

6dddd9b

markcmiller86 added 3 commits April 18, 2023 16:43

rm bad link

a1f8133

fix links

5339c58

fix special case

eabe97e

markcmiller86 marked this pull request as ready for review April 19, 2023 00:00

markcmiller86 added 2 commits February 4, 2024 10:47

test new checkere

e419e4b

add no-check-certs

5ad2057

fix links to pdfs in swe-cse-bof

cfa5b8d

bartlettroscoe mentioned this pull request Feb 8, 2024

CC: Manual Work is a Bug #1805

Open

markcmiller86 added 9 commits March 25, 2024 09:58

ws change to trigger a commit

3373b55

try new link check util

42e39a9

apply linkchecker per md file

122ba08

fix logic error

c9f4aa3

adjust params in checker

f34f557

adjust comma seps

4d83878

adjust comma seps

05e85dd

fix syntax for extension

a0f91c2

continue loop after failures

9bf7b9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CI for checking for broken links manually, weekly and in PRs #1633

Add CI for checking for broken links manually, weekly and in PRs #1633

markcmiller86 commented Apr 18, 2023 •

edited by bernhold

bartlettroscoe commented Apr 18, 2023

markcmiller86 commented Apr 18, 2023

markcmiller86 commented Apr 18, 2023

markcmiller86 commented Apr 19, 2023

markcmiller86 commented Apr 19, 2023

markcmiller86 commented Feb 2, 2024 •

edited

vsoch commented Feb 2, 2024

markcmiller86 commented Feb 2, 2024

vsoch commented Feb 2, 2024

bartlettroscoe commented Feb 3, 2024

bartlettroscoe commented Feb 3, 2024

markcmiller86 commented Feb 3, 2024

bartlettroscoe commented Feb 3, 2024

vsoch commented Feb 3, 2024

vsoch commented Feb 3, 2024

vsoch commented Feb 3, 2024

markcmiller86 commented Feb 3, 2024

vsoch commented Feb 4, 2024

markcmiller86 commented Feb 4, 2024

vsoch commented Feb 4, 2024

markcmiller86 commented Feb 4, 2024

vsoch commented Feb 4, 2024

Add CI for checking for broken links manually, weekly and in PRs #1633

Are you sure you want to change the base?

Add CI for checking for broken links manually, weekly and in PRs #1633

Conversation

markcmiller86 commented Apr 18, 2023 • edited by bernhold

bartlettroscoe commented Apr 18, 2023

markcmiller86 commented Apr 18, 2023

markcmiller86 commented Apr 18, 2023

markcmiller86 commented Apr 19, 2023

markcmiller86 commented Apr 19, 2023

markcmiller86 commented Feb 2, 2024 • edited

vsoch commented Feb 2, 2024

markcmiller86 commented Feb 2, 2024

vsoch commented Feb 2, 2024

bartlettroscoe commented Feb 3, 2024

bartlettroscoe commented Feb 3, 2024

markcmiller86 commented Feb 3, 2024

bartlettroscoe commented Feb 3, 2024

vsoch commented Feb 3, 2024

vsoch commented Feb 3, 2024

vsoch commented Feb 3, 2024

markcmiller86 commented Feb 3, 2024

vsoch commented Feb 4, 2024

markcmiller86 commented Feb 4, 2024

vsoch commented Feb 4, 2024

markcmiller86 commented Feb 4, 2024

vsoch commented Feb 4, 2024

markcmiller86 commented Apr 18, 2023 •

edited by bernhold

markcmiller86 commented Feb 2, 2024 •

edited