Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CI for checking for broken links manually, weekly and in PRs #1633

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

markcmiller86
Copy link
Member

@markcmiller86 markcmiller86 commented Apr 18, 2023

Resolves: #1431

Links will get checked automatically weekly at 5:17 AM on Sundays. They can also be checked manually by just running the workflow.

We need to adjust the filters used in the checker and there are a lot of broken links found (see #1632)

  • Set list of Reviewers (at least one).
  • Add to Project BSSw Internal.
  • [ ] View the modified *.md files as rendered in GitHub.
  • [ ] If changes are to the GitHub pages site under the docs/ directory, consider viewing locally with Jekyll.
  • Watch for PR check failures.
  • Make any final changes to the PR based on feedback and review GitHub (and Jekyll) rendered files.
  • Ensure at least one reviewer signs off on the changes.
  • Once reviewer has approved and PR check pass, then merge the PR.

@markcmiller86 markcmiller86 changed the title Create check-published-links-weekly.yml Add CI for checking for broken links weekly (and manually) Apr 18, 2023
@bartlettroscoe
Copy link
Member

@markcmiller86, please @mention me and let me know when this is ready to review

@markcmiller86 markcmiller86 marked this pull request as draft April 18, 2023 21:50
@markcmiller86
Copy link
Member Author

Apologies...I converted to DRAFT for time being.

@markcmiller86
Copy link
Member Author

Have adjusted it to ignore...

  • docs -- those links are often not really public...EB should fix but we can do as encountered.
  • Events -- short lived and links are likely to stale over time

@markcmiller86
Copy link
Member Author

Ok, this addresses all the cases in #1632.

I am pulling out of DRAFT mode. Its ready for review.

@markcmiller86 markcmiller86 marked this pull request as ready for review April 19, 2023 00:00
@markcmiller86
Copy link
Member Author

@rinkug, @bartlettroscoe and @bernhold this is now ready for review.

The weekly CI check should result in a small list of either bonified broken links or false negatives (e.g. says its broken when its not) to try to check and then either go edit files to fix, add to ignore patterns or just ignore.

@markcmiller86
Copy link
Member Author

markcmiller86 commented Feb 2, 2024

And as an alternative to fancy automation, could we post the failed links as a new issue? If the number of bad links is not to large for any one run, we can keep a couple of the issues around and someone can manually compare them for repeats.

Alternatively, we don't do as CI at all (at least not scheduled or maybe even manual...PR checks are still good though). Instead, we use python tools and someone on EB board periodically runs checks with locally available tooling and updates list of failure cases and then attempts to fix truly broken links as they are encountered.

@vsoch
Copy link
Contributor

vsoch commented Feb 2, 2024

I'm happy to fix the bugs, but as I mentioned I can't work on this during work time. You just pinged me 1-2 days ago so you haven't really given me a chance to work on it. I would suggest adding the URLs to the skip list for now.

@markcmiller86
Copy link
Member Author

I'm happy to fix the bugs, but as I mentioned I can't work on this during work time. You just pinged me 1-2 days ago so you haven't really given me a chance to work on it. I would suggest adding the URLs to the skip list for now.

@vsoch I mentioned you here in this conversation for a different reason...your thoughts and experience, if any, in dealing with and developing a strategy for the larger issue of high probabily of some (of the over 5,500 -- and still growing) links likely always failing for reasons other than the link actually needing to be replaced.

@vsoch
Copy link
Contributor

vsoch commented Feb 2, 2024

Ah gotcha! If you want to make new additions or changes speeder, then I'd recommend changed files: https://github.com/marketplace/actions/changed-files. I use that for container matrices so I only build containers with updated Dockerfile. If you are concerned about existing links breaking (across many files) a dumb thing I do is to segment a list of things (e.g., paths) into equal lists based on matches hashes to calendar month days, then run for the day (a small subset) each night. https://github.com/vsoch/split-list-action. You probably don't want to be checking everything on every PR, every time!

@bartlettroscoe
Copy link
Member

I submit that the perfect is the enemy of the good enough. I hope that there are knobs we can tweak to get the failure rate low enough that it is humanly manageable.

Thinking about this some more, I hope that will be the case if we set things up well and have well defined responsibilities (see below)

The urlchecker action has a number of parameters for retries and timeouts and such.

Retries in the same run of urlchecker will not help if a link is broken because of an expired cert or a machine upgrade, etc. Only running urlchecker days later will address that type of issue.

And as an alternative to fancy automation, could we post the failed links as a new issue? If the number of bad links is not to large for any one run, we can keep a couple of the issues around and someone can manually compare them for repeats. I think the results of the URL check could be posted as an issue using actions.

That will just create a bunch of GH issues that people will ignore because most of these will be false failures . This will just clutter up the GitHub issue DB. (I have a lot of experience observing human behavior about reacting to GH Issues.)

You need a process where there are very few false failures. My experience over 20+ years observing human behavior is that if the number of false failures is above say 30%, most people will just assume it could be a false failure and ignore it.

And maybe weekly is too often to run it. How about biweekly or monthly instead, just to reduce the frequency at which someone has to look at the results?

...

Instead, we use python tools and someone on EB board periodically runs checks and updates list of failure cases and then attempts to fix truly broken links as they are encountered.

How do people determine what is "truly broken" without manually clicking on the links in a browser? But I guess if the number of random failures stays low (say less than 10 total on any run of urlchecker on every *.md file) and you only run this check every month or so, then it would not take too much time to manually check these links. And my experience is that most external links stay around for quite some time. It is only older articles with links put in 1+ years ago where you have much of a problem with truly broken links.

So I would suggest something like:

  • Set up to run urlchecker on all of the *.md files once per month and have it send out emails to the editorial them when it fails (with a link to the GHA output to see what the failures are with some comments in the GHA output about expecting some random link check failures).

  • Someone on the editorial team will lock over the failures and manually check all of the broken links (e.g. assuming there less than 10 or so).

But who will be responsible for looking at the link check results? (If it is everyone's responsibility, then it is no one's responsibility.)

The above seems like a reasonable process, as long as someone is willing to spend 10 minutes every month looking at potential broken links.

Alternatively, we don't do as CI at all (at least not scheduled or maybe even manual...PR checks are still good though).

It seems that the GHA job will only check links in changed *.md files so I would argue that will not create a lot of false failures (because most *.md files have a few links). So I would argue we should run this GHA check on all PRs (and add some comments to the GHA output about expecting some random link check failures).

@bartlettroscoe
Copy link
Member

@vsoch, from the discussion above, I don't think there is anything wrong with urlchecker itself or anything that can be done inside of urlchecker to address the fundamental problem discussed above. The problem is that checking for external links and downloading data from those links is inherently going to lead to some number of random false failures because of issues with the external cites themselves (like expired certs). What is needed is a process that runs on top of sampled data produced by a checker like urlchecker where you expect some number of random failures where you run it several times over a period of days or weeks to generate sampled datasets. Then with this set of data collected over a period of time, you have a higher-level analysis tool look over that data for link checks that are consistently failing and then focus only on those. No one expects a checking tool like urlchecker to be able to do that.

@markcmiller86
Copy link
Member Author

No one expects a checking tool like urlchecker to be able to do that.

I agree...That said, @vsoch has add a --no-check-certs option to the checker to address the expried certs failures. Thanks @vsoch 💪🏻. She is also looking into making urlchecker do just head requests to make it faster (and not download the linked content as part of the checking process).

I just enlisted @vsoch for her thoughts on this broader question to see if she might be able to suggest solutions.

@bartlettroscoe
Copy link
Member

@vsoch has add a --no-check-certs option to the checker to address the expired certs failures.

@markcmiller86, when is this going to be released in a new version of:

being used by check-urls.yml?

It looks like there has not been a commit to the master branch of that repo in 2 years going back to urlstechie/urlchecker-action@b643b43.

@vsoch
Copy link
Contributor

vsoch commented Feb 3, 2024

Ping @davidbeckingsale - I have a many ideas for how to address the issues here but I don't have bandwidth soon to work on it, even in free time - there are just other things I need/want to do first. I think an RSE could help here.

@vsoch
Copy link
Contributor

vsoch commented Feb 3, 2024

@bartlettroscoe that is my library - the action is just a wrapper for urlchecker-python. And yes, it's been running smoothly for 2 years and hasn't had any feature requests or issues. I'm lead maintainer for these projects, they have about 330 users. Feel free to use something else.

@vsoch
Copy link
Contributor

vsoch commented Feb 3, 2024

Hi @bartlettroscoe @markcmiller86 I wanted to follow up here - realizing that probably I'm best oriented to work on this, I put aside the work I wanted to do for this afternoon and tackled this issue. I have a new release of urlchecker-python (0.0.35) and a branch with the action that you can test. Importantly:

  • the test branch now has support for --no-check-certs exposed as no_check_certs under the action
  • we use a HEAD request and fall back to GET and close the response (thanks for this insight @markcmiller86 I do believe it will be faster for large pages, which tend to be most these days)
  • I fixed (what I consider to be) a bug that we could not target a single file (you now can, if you install / run it locally)

The main issues had nothing to do with the above, and actually came down to changes in the selenium interface. When I debugged I found this change which meant that we were not using a web driver at all and relying solely on requests. The fact that most were failing is a reflection of a huge change in the web - it used to be the case that most would work for requests, and there was only an off case or two where you needed the driver. Now more sites are able to detect (what comes down to) scrapers, and they don't allow this. So since our webdriver was failing, this was resulting in bad results.

@markcmiller86 could you please test out the branch, with the added option to not check certificates, and let me know your feedback? If/when it is good I can merge that PR and do another release of urlchecker-action. Thanks!

@markcmiller86
Copy link
Member Author

@markcmiller86 could you please test out the branch, with the added option to not check certificates, and let me know your feedback? If/when it is good I can merge that PR and do another release of urlchecker-action. Thanks!

@vsoch this is great news! 🎉 💪. Thanks so much for tackling this ❤️.

I will try to test before end of weekend.

@vsoch
Copy link
Contributor

vsoch commented Feb 4, 2024

Looks like we can try to suppress the warnings to clean up the output a bit:

import requests
requests.packages.urllib3.disable_warnings() 

@markcmiller86
Copy link
Member Author

@vsoch I tested as an action using the @update/0.0.35 moniker to specify action version. I tried with and without no_check_certs. It seems to be working as expected. I can't speak quantitatively to whether it went any faster but it seems to go a bit faster.

@vsoch
Copy link
Contributor

vsoch commented Feb 4, 2024

I am preparing a branch with tweaks, such as silencing those warnings. Let me know if you want something else tested before I push a test image and we can try again.

@markcmiller86
Copy link
Member Author

Looks like we can try to suppress the warnings to clean up the output a bit:

import requests
requests.packages.urllib3.disable_warnings() 

Hmm...I kinda like those. I mean, one issue I have with most actions is that when things go wrong, there is very little information as to why. Maybe tie this output to your verbose: true variable as well? I think it could be important to actually see cause if desired.

@vsoch
Copy link
Contributor

vsoch commented Feb 4, 2024

okay, happy not to do any more work then! 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Periodic CI to find links that have gone bad
5 participants