Add Anti-Bot Detection middleware #7349

davidhicks · 2024-02-22T14:52:53Z

This new middleware is in two parts:

Middleware to detect use of anti-bot methods such as Cloudflare Captchas, and various other Web Application Firewalls (WAF) implementing the likes of TLS fingerprinting, browser JS challenges, etc.
Middleware to stop a spider from generating more requests, leading to quick (but still deferred) stopping of a crawl once anti-bot methods have been detected.

Spiders can implement a new attribute "anti_bot_methods" which is a list of AntiBotMethod as specified in
locations/middlewares/anti_bot_detection.py

AntiBotDetectionMiddleware will automatically set the attribute if anti-bot method(s) are discovered.

If a particular anti-bot method is unexpected or if Zyte API is not available to use for bypassing the anti-bot method, AntiBotStopCrawl will automatically shut down the crawl in a graceful way, generally after all requests already sent have been received and parsed.

This new middleware is in two parts: 1. Middleware to detect use of anti-bot methods such as Cloudflare Captchas, and various other Web Application Firewalls (WAF) implementing the likes of TLS fingerprinting, browser JS challenges, etc. 2. Middleware to stop a spider from generating more requests, leading to quick (but still deferred) stopping of a crawl once anti-bot methods have been detected. Spiders can implement a new attribute "anti_bot_methods" which is a list of AntiBotMethod as specified in locations/middlewares/anti_bot_detection.py AntiBotDetectionMiddleware will automatically set the attribute if anti-bot method(s) are discovered. If a particular anti-bot method is unexpected or if Zyte API is not available to use for bypassing the anti-bot method, AntiBotStopCrawl will automatically shut down the crawl in a graceful way, generally after all requests already sent have been received and parsed.

for more information, see https://pre-commit.ci

davidhicks · 2024-02-22T14:58:57Z

Do we need to add to stats as well as raising an INFO log?

For example:

...
atp/antibot/azure_waf: True
atp/antibot/human: True
...

CloCkWeRX · 2024-02-22T15:04:17Z

This seems sensible. Thinking about your auto detection branch, could this annotate spiders to automatically use a proxy? And I guess second question, should it?

davidhicks · 2024-02-22T15:14:53Z

This seems sensible. Thinking about your auto detection branch, could this annotate spiders to automatically use a proxy? And I guess second question, should it?

I think this anti-bot detection branch should be integrated with the automatic spider generation branch sf command, the existing sd command, etc so that these commands can be seamlessly executed via Zyte API if anti-bot methods are detected and the user has a Zyte account configured.

I also think it is worthwhile to set the anti_bot_methods attribute wherever possible (not just rely on the middleware to set it) because if anti_bot_methods is specified, and one of those methods is not deemed bypassable or Zyte API is not available, the spider can just be closed without making any requests. There was mention in earlier issues/PRs of an intent to "archive" spiders rather than delete them if the brand/operator implements an anti-bot method that can't be bypassed with current means.

CloCkWeRX · 2024-03-01T08:39:03Z

@iandees or @Cj-Malone any thoughts?

Seems like there's a plan for the next phase of this which I like a lot.

The annotation of this pipeline changing as the zyte API plays whackamole is a bit of a maintenance papercut, where keeping it in sync with the code is tricky or requires humans...
But on the other hand we get indicators for blocking methods without having a human involved. So the maintenance burden overall is reduced.

iandees · 2024-03-01T16:02:07Z

locations/middlewares/anti_bot_detection.py

+    AZURE_WAF = {"name": "Azure WAF", "zyte_bypassable": True}
+    CLOUDFLARE = {"name": "Cloudflare", "zyte_bypassable": True}
+    DATADOME = {"name": "DataDome", "zyte_bypassable": True}
+    HUMAN = {"name": "HUMAN", "zyte_bypassable": True}
+    IMPERVA = {"name": "Imperva", "zyte_bypassable": True}
+    QRATOR = {"name": "Qrator", "zyte_bypassable": True}


If it's easy, can you add comments with links to documentation about these systems?

Can do, it's easy :)

Friendly nudge on this.

I've added some links to documentation for each anti-bot product, or where this wasn't found, product brochures that describe the product. Let me know if more information is wanted and I can add it.

iandees · 2024-03-01T16:06:08Z

locations/middlewares/anti_bot_detection.py

+
+    @staticmethod
+    def decode_http_header_value(raw_header_value: bytes) -> str:
+        # It's not quite so simple to decode HTTP header values.


Why not? Have you seen non-unicode values in the headers that you're inspecting?

I haven't looked (just relied on clients automatically decoding header values) but imagine UTF-8 is almost universal, and other encodings are esoteric enough to ignore. I had hoped there was some Scrapy method that handles decoding correctly, but haven't found it yet.

Cj-Malone · 2024-03-20T22:16:31Z

locations/middlewares/anti_bot_detection.py

+            # 2. https://docs.scrapy.org/en/latest/topics/settings.html#downloader-middlewares-base
+            # It is probably not a good idea to reorder these
+            # default middleware orders.
+            httpcm = HttpCompressionMiddleware.from_crawler(spider.crawler)


This probably shouldn't be initiated on emery request

This only triggers on a HTTP 403 error though, which would typically result in a spider terminating its crawl. I don't think we'll see a performance impact as a result, because we wouldn't expect more than a few HTTP 403 errors (>1 because multiple requests could have been fired at once, and perhaps 10 HTTP 403 errors are still being sent back to Scrapy for handling).

AFAIK Azure WAF sometimes only triggers after X number of requests in Y seconds. So the first 20 requests could be fine, then request 21 triggers a HTTP 403 error and CAPTCHA challenge from Azure WAF.

Cj-Malone · 2024-03-20T22:21:19Z

locations/middlewares/anti_bot_detection.py

+
+    # Cloudflare bot protection documentation:
+    # https://developers.cloudflare.com/bots/
+    CLOUDFLARE = {"name": "Cloudflare", "zyte_bypassable": True}


How true is this? We hit a lot of CF sites already with no issues, we log the stats too iirc. I don't know the CF terminology, but there are different levels, we handle some fine, playwright handles some, but does Zyte work for them all? Or are there limits there too? Does it matter?

Generally I've only deemed something to be "anti-bot" if there is a CAPTCHA or JavaScript challenge required to be completed by the client. Cloudflare and many other services have confusing marketing, but some will provide the company with a choice of implementing IP address geographic restrictions, data centre IP address blocking, CAPTCHA challenges, JavaScript challenges, rate limiting, etc.

I don't believe Zyte have a status page available where they advertise their current/live status of where they are in the arms race with anti-bot services. It therefore seems like we'd have to trial and error approach (perhaps even before each crawl commences, if Scrapy allows this) to populate the zyte_bypassable values.

Cj-Malone · 2024-03-20T22:23:52Z

locations/middlewares/anti_bot_detection.py

+            if response.status == 403 and server_utf8.upper() == "MICROSOFT-AZURE-APPLICATION-GATEWAY/V2":
+                self.add_anti_bot_method(AntiBotMethods.AZURE_WAF, spider)
+
+        if cookies := response.headers.getlist("set-cookie"):


I think scrapy is having a discussion about making cookies more accessible, I'll try and find the link tomorrow. We may want to be part of that conversation

Is it scrapy/scrapy#5431 ?

Or scrapy/scrapy#6218

locations/middlewares/anti_bot_detection.py

davidhicks and others added 2 commits February 23, 2024 01:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

80cb5ad

for more information, see https://pre-commit.ci

iandees reviewed Mar 1, 2024

View reviewed changes

AntiBotMethods: provide link to documentation/brochure for each company

9581a4c

Cj-Malone reviewed Mar 20, 2024

View reviewed changes

locations/middlewares/anti_bot_detection.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Anti-Bot Detection middleware #7349

Add Anti-Bot Detection middleware #7349

davidhicks commented Feb 22, 2024

davidhicks commented Feb 22, 2024

CloCkWeRX commented Feb 22, 2024

davidhicks commented Feb 22, 2024 •

edited

CloCkWeRX commented Mar 1, 2024

iandees Mar 1, 2024

davidhicks Mar 2, 2024

iandees Mar 20, 2024

davidhicks Mar 20, 2024

iandees Mar 1, 2024

davidhicks Mar 2, 2024

Cj-Malone Mar 20, 2024

davidhicks Mar 21, 2024

Cj-Malone Mar 20, 2024

davidhicks Mar 21, 2024

Cj-Malone Mar 20, 2024

davidhicks Mar 21, 2024

Cj-Malone Mar 21, 2024

Add Anti-Bot Detection middleware #7349

Are you sure you want to change the base?

Add Anti-Bot Detection middleware #7349

Conversation

davidhicks commented Feb 22, 2024

davidhicks commented Feb 22, 2024

CloCkWeRX commented Feb 22, 2024

davidhicks commented Feb 22, 2024 • edited

CloCkWeRX commented Mar 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidhicks commented Feb 22, 2024 •

edited