How to defeat CloudFlare challenges? #738

wom-bat · 2023-05-10T01:22:17Z

Summary

Our website has lists of publications with links to the original publishers. Many of these are protected by CloudFlare challenges, and report as 403: Forbidden by the linkchecker.

Steps to reproduce

Check a webpage that contains a link to a cloud-flare-protected page.
For instance,
The HACMS program: using formal methods to eliminate exploitable bugs

Actual result

I expect the link check to pass, as the link is valid. It actually gives a 403 forbidden response.

Expected result

Environment

Operating system: Debian GNU/Linux 11.7
Linkchecker version: 10.2
Python version: 3.9
Install method: Built from distribution package

Configuration

DEBUG linkcheck.cmdline 2023-05-10 11:06:58,728 MainThread configuration: [('aborttimeout', 300),
('allowedschemes', []),
('authentication', []),
('checkextern', False),
('cookiefile', None),
('csv', {}),
('debugmemory', False),
('dot', {}),
('enabledplugins', []),
('externlinks', []),
('failures', {}),
('fileoutput', []),
('gml', {}),
('gxml', {}),
('html', {}),
('ignoreerrors', []),
('ignorewarnings', []),
('internlinks', []),
('localwebroot', None),
('logger', 'NoneLogger'),
('loginextrafields', {}),
('loginpasswordfield', 'password'),
('loginurl', None),
('loginuserfield', 'login'),
('maxfilesizedownload', 5242880),
('maxfilesizeparse', 1048576),
('maxhttpredirects', 10),
('maxnumurls', None),
('maxrequestspersecond', 10),
('maxrunseconds', None),
('nntpserver', None),
('none', {}),
('output', 'text'),
('pluginfolders', []),
('quiet', False),
('recursionlevel', -1),
('resultcachesize', 100000),
('robotstxt', True),
('sitemap', {}),
('sql', {}),
('sslverify', '/etc/ssl/certs/ca-certificates.crt'),
('status', True),
('status_wait_seconds', 5),
('text', {}),
('threads', 10),
('timeout', 60),
('trace', False),
('useragent',
'Mozilla/5.0 (compatible; LinkChecker/10.2.1; '
'+https://linkchecker.github.io/linkchecker/)'),
('verbose', False),
('warnings', True),
('xml', {})]
WARNING linkcheck.cmdline 2023-05-10 11:06:58,729 MainThread no files or URLs given

Logs

/usr/bin/linkchecker -F 'html/var/www/html/linkcheck/index.html' --no-status --ignore-url='^https?://twitter.com/' --ignore-url=print$ --ignore-url='^mailto:' --ignore-url='https?://scholar.google.com(.au)?/.*' --user-agent='Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Firefox/38.0 SeaMonkey/2.35' --check-extern https://trustworthy.systems/publications
Read the documentation at https://linkchecker.github.io/linkchecker/
Write comments and bugs to https://github.com/linkchecker/linkchecker/issues

Start checking at 2023-05-10 11:11:43+011

URL https://thesis.cse.unsw.edu.au/search?search_query=heiser&search_by=Supervisor' Name official list'
Parent URL https://trustworthy.systems/students/theses, line 221, col 4
Real URL https://thesis.cse.unsw.edu.au/search?search_query=heiser&search_by=Supervisor
Check time 1.038 seconds
Size 145B
Result Error: 404 Not Found

URL http://dx.doi.org/10.1002/%28SICI%291099-159X%28199611/12%294:6%3C399::AID-PIP148%3E3.0.CO;2-4' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2246, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199611/12)4:6%3C399::AID-PIP148%3E3.0.CO;2-4 Check time 1.304 seconds Info Redirected to https://dx.doi.org/10.1002/(SICI)1099-159X(199611/12)4%3A6%3C399%3A%3AAID-PIP148%3E3.0.CO;2-4'.
Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199611/12)4:6%3C399::AID-PIP148%3E3.0.CO;2-4'.
Result Error: 403 Forbidden

URL http://dx.doi.org/10.1002/%28SICI%291099-159X%28199609/10%294:5%3C355::AID-PIP145%3E3.0.CO;2-X' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2270, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199609/10)4:5%3C355::AID-PIP145%3E3.0.CO;2-X Check time 1.131 seconds Info Redirected to https://dx.doi.org/10.1002/(SICI)1099-159X(199609/10)4%3A5%3C355%3A%3AAID-PIP145%3E3.0.CO;2-X'.
Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/(SICI)1099-159X(199609/10)4:5%3C355::AID-PIP145%3E3.0.CO;2-X'.
Result Error: 403 Forbidden

URL http://dx.doi.org/10.1002/pip.4670020103' Parent URL https://trustworthy.systems/people/?cn=Gernot%20Heiser, line 2426, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/pip.4670020103 Check time 1.287 seconds Info Redirected to https://dx.doi.org/10.1002/pip.4670020103'.
Redirected to
`https://onlinelibrary.wiley.com/doi/10.1002/pip.4670020103'.
Result Error: 403 Forbidden

URL http://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract' Parent URL https://trustworthy.systems/people/?cn=Gerwin%20Klein, line 1323, col 1 Real URL https://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract Check time 2.509 seconds Info Redirected to https://onlinelibrary.wiley.com/doi/10.1002/cpe.597/abstract'.
Result Error: 403 Forbidden

Other notes

There're quite a few python libraries out there that purport to bypass cloudflare's protection.

The text was updated successfully, but these errors were encountered:

mind-bending-forks · 2023-09-19T15:20:17Z

I came here to report what I believe to be fundamentally the same problem, which is not specific to cloudflare. (Happy to spike this out as a separate issue, if it is considered different.)

The documents I am checking may contain links to pages on websites that are ordinarily only accessible when logged in. This is expected. When such links are visited without being logged in (as is the case when accessed by linkchecker), the website returns 403 Forbidden and presents information about how to log in or redirects to a login page. If such links are visited by a human and the human logs in with credentials known to them, they are likely to be taken to the original URL as a logged in user. In that sense, the links are valid. I don't know what the URLs will be in advance, nor which ones will be behind a login, so I can't define a rule to exclude the affected URLs. I just expect that there will be URLs of that type.

I wish to configure LinkChecker to accept URLs that reach a valid server that subsequently responds with 403 Forbidden, rather than treating them as an error. Returning a warning, rather than an error, would be fine.

How can I instruct LinkChecker to not report an error when the server responds with 403 Forbidden?

Thanks.

cjmayo · 2023-09-19T18:53:44Z

How can I instruct LinkChecker to not report an error when the server responds with 403 Forbidden?

Same status code, different question I suspect - fortunately because there is a solution the linkcheckerrc ignoreerrors setting, in the output section:

https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#url-checking-results

Haven't tried it for this. A URL regular expression is required but a match any .* will probably work (plus a message regular expression of course to match just a 403).

mind-bending-forks · 2023-09-20T21:53:37Z

Thank you for this @cjmayo.

I had to overcome the hurdle that the ignoreerrors config value is only supported from LinkChecker 10.2.0 and onwards, but once I'd done that, this appears to works well.

On further investigation, I did actually discover that one of the URLs returning 403 Forbidden for me (an institution page on ResearchGate) is affected by the cloudflare blocking problem. If it were possible to validate the response that a normal user would be presented with when visiting the URL, rather than simply masking the problem like the ignoreerrors approach does, as has been requested here, then that would be advantageous. I'm not sure how cloudflare is deciding to block the request. Overriding the user agent with one from an up-to-date browser did not work.

This was referenced Jan 17, 2024

feature request(option): return exit(1) if links from $n domains has errors #787

Open

feature request(plugin): requests from another location and using another tools #790

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to defeat CloudFlare challenges? #738

How to defeat CloudFlare challenges? #738

wom-bat commented May 10, 2023

mind-bending-forks commented Sep 19, 2023

cjmayo commented Sep 19, 2023 •

edited

mind-bending-forks commented Sep 20, 2023

How to defeat CloudFlare challenges? #738

How to defeat CloudFlare challenges? #738

Comments

wom-bat commented May 10, 2023

Summary

Steps to reproduce

Actual result

Expected result

Environment

Configuration

Logs

Other notes

mind-bending-forks commented Sep 19, 2023

cjmayo commented Sep 19, 2023 • edited

mind-bending-forks commented Sep 20, 2023

cjmayo commented Sep 19, 2023 •

edited