Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to defeat CloudFlare challenges? #738

Open
wom-bat opened this issue May 10, 2023 · 3 comments
Open

How to defeat CloudFlare challenges? #738

wom-bat opened this issue May 10, 2023 · 3 comments

Comments

@wom-bat
Copy link

wom-bat commented May 10, 2023

Summary

Our website has lists of publications with links to the original publishers. Many of these are protected by CloudFlare challenges, and report as 403: Forbidden by the linkchecker.

Steps to reproduce

Check a webpage that contains a link to a cloud-flare-protected page.
For instance,
The HACMS program: using formal methods to eliminate exploitable bugs

Actual result

I expect the link check to pass, as the link is valid. It actually gives a 403 forbidden response.

Expected result

Environment

  • Operating system: Debian GNU/Linux 11.7
  • Linkchecker version: 10.2
  • Python version: 3.9
  • Install method: Built from distribution package

Configuration

DEBUG linkcheck.cmdline 2023-05-10 11:06:58,728 MainThread configuration: [('aborttimeout', 300),
('allowedschemes', []),
('authentication', []),
('checkextern', False),
('cookiefile', None),
('csv', {}),
('debugmemory', False),
('dot', {}),
('enabledplugins', []),
('externlinks', []),
('failures', {}),
('fileoutput', []),
('gml', {}),
('gxml', {}),
('html', {}),
('ignoreerrors', []),
('ignorewarnings', []),
('internlinks', []),
('localwebroot', None),
('logger', 'NoneLogger'),
('loginextrafields', {}),
('loginpasswordfield', 'password'),
('loginurl', None),
('loginuserfield', 'login'),
('maxfilesizedownload', 5242880),
('maxfilesizeparse', 1048576),
('maxhttpredirects', 10),
('maxnumurls', None),
('maxrequestspersecond', 10),
('maxrunseconds', None),
('nntpserver', None),
('none', {}),
('output', 'text'),
('pluginfolders', []),
('quiet', False),
('recursionlevel', -1),
('resultcachesize', 100000),
('robotstxt', True),
('sitemap', {}),
('sql', {}),
('sslverify', '/etc/ssl/certs/ca-certificates.crt'),
('status', True),
('status_wait_seconds', 5),
('text', {}),
('threads', 10),
('timeout', 60),
('trace', False),
('useragent',
'Mozilla/5.0 (compatible; LinkChecker/10.2.1; '
'+https://linkchecker.github.io/linkchecker/)'),
('verbose', False),
('warnings', True),
('xml', {})]
WARNING linkcheck.cmdline 2023-05-10 11:06:58,729 MainThread no files or URLs given

Logs

/usr/bin/linkchecker -F 'html/var/www/html/linkcheck/index.html' --no-status --ignore-url='^https?://twitter.com/' --ignore-url=print$ --ignore-url='^mailto:' --ignore-url='https?://scholar.google.com(.au)?/.*' --user-agent='Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Firefox/38.0 SeaMonkey/2.35' --check-extern https://trustworthy.systems/publications
Read the documentation at https://linkchecker.github.io/linkchecker/
Write comments and bugs to https://github.com/linkchecker/linkchecker/issues

Start checking at 2023-05-10 11:11:43+011

URL https://thesis.cse.unsw.edu.au/search?search_query=heiser&search_by=Supervisor' Name official list'
Parent URL https://trustworthy.systems/students/theses, line 221, col 4
Real URL https://thesis.cse.unsw.edu.au/search?search_query=heiser&search_by=Supervisor
Check time 1.038 seconds
Size 145B
Result Error: 404 Not Found

URL http://dx.doi.org/10.1002/%28SICI%291099-159X%28199611/12%294:6%3C399::AID-PIP148%3E3.0.CO;2-4' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2246, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199611/12)4:6%3C399::AID-PIP148%3E3.0.CO;2-4 Check time 1.304 seconds Info Redirected to https://dx.doi.org/10.1002/(SICI)1099-159X(199611/12)4%3A6%3C399%3A%3AAID-PIP148%3E3.0.CO;2-4'.
Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199611/12)4:6%3C399::AID-PIP148%3E3.0.CO;2-4'.
Result Error: 403 Forbidden

URL http://dx.doi.org/10.1002/%28SICI%291099-159X%28199609/10%294:5%3C355::AID-PIP145%3E3.0.CO;2-X' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2270, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199609/10)4:5%3C355::AID-PIP145%3E3.0.CO;2-X Check time 1.131 seconds Info Redirected to https://dx.doi.org/10.1002/(SICI)1099-159X(199609/10)4%3A5%3C355%3A%3AAID-PIP145%3E3.0.CO;2-X'.
Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199609/10)4:5%3C355::AID-PIP145%3E3.0.CO;2-X'.
Result Error: 403 Forbidden

URL http://dx.doi.org/10.1002/pip.4670020103' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2426, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/pip.4670020103 Check time 1.287 seconds Info Redirected to https://dx.doi.org/10.1002/pip.4670020103'.
Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/pip.4670020103'.
Result Error: 403 Forbidden

URL http://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract' Parent URL https://trustworthy.systems/people/?cn=Gerwin%20Klein, line 1323, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract Check time 2.509 seconds Info Redirected to https://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract'.
Result Error: 403 Forbidden

Other notes

There're quite a few python libraries out there that purport to bypass cloudflare's protection.

@mind-bending-forks
Copy link

I came here to report what I believe to be fundamentally the same problem, which is not specific to cloudflare. (Happy to spike this out as a separate issue, if it is considered different.)

The documents I am checking may contain links to pages on websites that are ordinarily only accessible when logged in. This is expected. When such links are visited without being logged in (as is the case when accessed by linkchecker), the website returns 403 Forbidden and presents information about how to log in or redirects to a login page. If such links are visited by a human and the human logs in with credentials known to them, they are likely to be taken to the original URL as a logged in user. In that sense, the links are valid. I don't know what the URLs will be in advance, nor which ones will be behind a login, so I can't define a rule to exclude the affected URLs. I just expect that there will be URLs of that type.

I wish to configure LinkChecker to accept URLs that reach a valid server that subsequently responds with 403 Forbidden, rather than treating them as an error. Returning a warning, rather than an error, would be fine.

How can I instruct LinkChecker to not report an error when the server responds with 403 Forbidden?

Thanks.

@cjmayo
Copy link
Contributor

cjmayo commented Sep 19, 2023

How can I instruct LinkChecker to not report an error when the server responds with 403 Forbidden?

Same status code, different question I suspect - fortunately because there is a solution the linkcheckerrc ignoreerrors setting, in the output section:

https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#url-checking-results

Haven't tried it for this. A URL regular expression is required but a match any .* will probably work (plus a message regular expression of course to match just a 403).

@mind-bending-forks
Copy link

Thank you for this @cjmayo.

I had to overcome the hurdle that the ignoreerrors config value is only supported from LinkChecker 10.2.0 and onwards, but once I'd done that, this appears to works well.

On further investigation, I did actually discover that one of the URLs returning 403 Forbidden for me (an institution page on ResearchGate) is affected by the cloudflare blocking problem. If it were possible to validate the response that a normal user would be presented with when visiting the URL, rather than simply masking the problem like the ignoreerrors approach does, as has been requested here, then that would be advantageous. I'm not sure how cloudflare is deciding to block the request. Overriding the user agent with one from an up-to-date browser did not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants