New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Anti-Bot Detection middleware #7349
base: master
Are you sure you want to change the base?
Conversation
This new middleware is in two parts: 1. Middleware to detect use of anti-bot methods such as Cloudflare Captchas, and various other Web Application Firewalls (WAF) implementing the likes of TLS fingerprinting, browser JS challenges, etc. 2. Middleware to stop a spider from generating more requests, leading to quick (but still deferred) stopping of a crawl once anti-bot methods have been detected. Spiders can implement a new attribute "anti_bot_methods" which is a list of AntiBotMethod as specified in locations/middlewares/anti_bot_detection.py AntiBotDetectionMiddleware will automatically set the attribute if anti-bot method(s) are discovered. If a particular anti-bot method is unexpected or if Zyte API is not available to use for bypassing the anti-bot method, AntiBotStopCrawl will automatically shut down the crawl in a graceful way, generally after all requests already sent have been received and parsed.
for more information, see https://pre-commit.ci
Do we need to add to stats as well as raising an INFO log? For example:
|
This seems sensible. Thinking about your auto detection branch, could this annotate spiders to automatically use a proxy? And I guess second question, should it? |
I think this anti-bot detection branch should be integrated with the automatic spider generation branch I also think it is worthwhile to set the |
@iandees or @Cj-Malone any thoughts? Seems like there's a plan for the next phase of this which I like a lot. The annotation of this pipeline changing as the zyte API plays whackamole is a bit of a maintenance papercut, where keeping it in sync with the code is tricky or requires humans... |
AZURE_WAF = {"name": "Azure WAF", "zyte_bypassable": True} | ||
CLOUDFLARE = {"name": "Cloudflare", "zyte_bypassable": True} | ||
DATADOME = {"name": "DataDome", "zyte_bypassable": True} | ||
HUMAN = {"name": "HUMAN", "zyte_bypassable": True} | ||
IMPERVA = {"name": "Imperva", "zyte_bypassable": True} | ||
QRATOR = {"name": "Qrator", "zyte_bypassable": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's easy, can you add comments with links to documentation about these systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, it's easy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Friendly nudge on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some links to documentation for each anti-bot product, or where this wasn't found, product brochures that describe the product. Let me know if more information is wanted and I can add it.
|
||
@staticmethod | ||
def decode_http_header_value(raw_header_value: bytes) -> str: | ||
# It's not quite so simple to decode HTTP header values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? Have you seen non-unicode values in the headers that you're inspecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked (just relied on clients automatically decoding header values) but imagine UTF-8 is almost universal, and other encodings are esoteric enough to ignore. I had hoped there was some Scrapy method that handles decoding correctly, but haven't found it yet.
# 2. https://docs.scrapy.org/en/latest/topics/settings.html#downloader-middlewares-base | ||
# It is probably not a good idea to reorder these | ||
# default middleware orders. | ||
httpcm = HttpCompressionMiddleware.from_crawler(spider.crawler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably shouldn't be initiated on emery request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only triggers on a HTTP 403 error though, which would typically result in a spider terminating its crawl. I don't think we'll see a performance impact as a result, because we wouldn't expect more than a few HTTP 403 errors (>1 because multiple requests could have been fired at once, and perhaps 10 HTTP 403 errors are still being sent back to Scrapy for handling).
AFAIK Azure WAF sometimes only triggers after X number of requests in Y seconds. So the first 20 requests could be fine, then request 21 triggers a HTTP 403 error and CAPTCHA challenge from Azure WAF.
|
||
# Cloudflare bot protection documentation: | ||
# https://developers.cloudflare.com/bots/ | ||
CLOUDFLARE = {"name": "Cloudflare", "zyte_bypassable": True} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How true is this? We hit a lot of CF sites already with no issues, we log the stats too iirc. I don't know the CF terminology, but there are different levels, we handle some fine, playwright handles some, but does Zyte work for them all? Or are there limits there too? Does it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I've only deemed something to be "anti-bot" if there is a CAPTCHA or JavaScript challenge required to be completed by the client. Cloudflare and many other services have confusing marketing, but some will provide the company with a choice of implementing IP address geographic restrictions, data centre IP address blocking, CAPTCHA challenges, JavaScript challenges, rate limiting, etc.
I don't believe Zyte have a status page available where they advertise their current/live status of where they are in the arms race with anti-bot services. It therefore seems like we'd have to trial and error approach (perhaps even before each crawl commences, if Scrapy allows this) to populate the zyte_bypassable
values.
if response.status == 403 and server_utf8.upper() == "MICROSOFT-AZURE-APPLICATION-GATEWAY/V2": | ||
self.add_anti_bot_method(AntiBotMethods.AZURE_WAF, spider) | ||
|
||
if cookies := response.headers.getlist("set-cookie"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scrapy is having a discussion about making cookies more accessible, I'll try and find the link tomorrow. We may want to be part of that conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it scrapy/scrapy#5431 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new middleware is in two parts:
Middleware to detect use of anti-bot methods such as Cloudflare Captchas, and various other Web Application Firewalls (WAF) implementing the likes of TLS fingerprinting, browser JS challenges, etc.
Middleware to stop a spider from generating more requests, leading to quick (but still deferred) stopping of a crawl once anti-bot methods have been detected.
Spiders can implement a new attribute "anti_bot_methods" which is a list of AntiBotMethod as specified in
locations/middlewares/anti_bot_detection.py
AntiBotDetectionMiddleware will automatically set the attribute if anti-bot method(s) are discovered.
If a particular anti-bot method is unexpected or if Zyte API is not available to use for bypassing the anti-bot method, AntiBotStopCrawl will automatically shut down the crawl in a graceful way, generally after all requests already sent have been received and parsed.