Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Anti-Bot Detection middleware #7349

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Add Anti-Bot Detection middleware #7349

wants to merge 3 commits into from

Conversation

davidhicks
Copy link
Member

This new middleware is in two parts:

  1. Middleware to detect use of anti-bot methods such as Cloudflare Captchas, and various other Web Application Firewalls (WAF) implementing the likes of TLS fingerprinting, browser JS challenges, etc.

  2. Middleware to stop a spider from generating more requests, leading to quick (but still deferred) stopping of a crawl once anti-bot methods have been detected.

Spiders can implement a new attribute "anti_bot_methods" which is a list of AntiBotMethod as specified in
locations/middlewares/anti_bot_detection.py

AntiBotDetectionMiddleware will automatically set the attribute if anti-bot method(s) are discovered.

If a particular anti-bot method is unexpected or if Zyte API is not available to use for bypassing the anti-bot method, AntiBotStopCrawl will automatically shut down the crawl in a graceful way, generally after all requests already sent have been received and parsed.

davidhicks and others added 2 commits February 23, 2024 01:39
This new middleware is in two parts:

1. Middleware to detect use of anti-bot methods such as Cloudflare
   Captchas, and various other Web Application Firewalls (WAF)
   implementing the likes of TLS fingerprinting, browser JS challenges,
   etc.

2. Middleware to stop a spider from generating more requests, leading to
   quick (but still deferred) stopping of a crawl once anti-bot methods
   have been detected.

Spiders can implement a new attribute "anti_bot_methods" which is a list
of AntiBotMethod as specified in
locations/middlewares/anti_bot_detection.py

AntiBotDetectionMiddleware will automatically set the attribute if
anti-bot method(s) are discovered.

If a particular anti-bot method is unexpected or if Zyte API is not
available to use for bypassing the anti-bot method, AntiBotStopCrawl
will automatically shut down the crawl in a graceful way, generally
after all requests already sent have been received and parsed.
@davidhicks
Copy link
Member Author

Do we need to add to stats as well as raising an INFO log?

For example:

...
atp/antibot/azure_waf: True
atp/antibot/human: True
...

@CloCkWeRX
Copy link
Contributor

This seems sensible. Thinking about your auto detection branch, could this annotate spiders to automatically use a proxy? And I guess second question, should it?

@davidhicks
Copy link
Member Author

davidhicks commented Feb 22, 2024

This seems sensible. Thinking about your auto detection branch, could this annotate spiders to automatically use a proxy? And I guess second question, should it?

I think this anti-bot detection branch should be integrated with the automatic spider generation branch sf command, the existing sd command, etc so that these commands can be seamlessly executed via Zyte API if anti-bot methods are detected and the user has a Zyte account configured.

I also think it is worthwhile to set the anti_bot_methods attribute wherever possible (not just rely on the middleware to set it) because if anti_bot_methods is specified, and one of those methods is not deemed bypassable or Zyte API is not available, the spider can just be closed without making any requests. There was mention in earlier issues/PRs of an intent to "archive" spiders rather than delete them if the brand/operator implements an anti-bot method that can't be bypassed with current means.

@CloCkWeRX
Copy link
Contributor

@iandees or @Cj-Malone any thoughts?

Seems like there's a plan for the next phase of this which I like a lot.

The annotation of this pipeline changing as the zyte API plays whackamole is a bit of a maintenance papercut, where keeping it in sync with the code is tricky or requires humans...
But on the other hand we get indicators for blocking methods without having a human involved. So the maintenance burden overall is reduced.

Comment on lines 15 to 20
AZURE_WAF = {"name": "Azure WAF", "zyte_bypassable": True}
CLOUDFLARE = {"name": "Cloudflare", "zyte_bypassable": True}
DATADOME = {"name": "DataDome", "zyte_bypassable": True}
HUMAN = {"name": "HUMAN", "zyte_bypassable": True}
IMPERVA = {"name": "Imperva", "zyte_bypassable": True}
QRATOR = {"name": "Qrator", "zyte_bypassable": True}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easy, can you add comments with links to documentation about these systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do, it's easy :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly nudge on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some links to documentation for each anti-bot product, or where this wasn't found, product brochures that describe the product. Let me know if more information is wanted and I can add it.


@staticmethod
def decode_http_header_value(raw_header_value: bytes) -> str:
# It's not quite so simple to decode HTTP header values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? Have you seen non-unicode values in the headers that you're inspecting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked (just relied on clients automatically decoding header values) but imagine UTF-8 is almost universal, and other encodings are esoteric enough to ignore. I had hoped there was some Scrapy method that handles decoding correctly, but haven't found it yet.

# 2. https://docs.scrapy.org/en/latest/topics/settings.html#downloader-middlewares-base
# It is probably not a good idea to reorder these
# default middleware orders.
httpcm = HttpCompressionMiddleware.from_crawler(spider.crawler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably shouldn't be initiated on emery request

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only triggers on a HTTP 403 error though, which would typically result in a spider terminating its crawl. I don't think we'll see a performance impact as a result, because we wouldn't expect more than a few HTTP 403 errors (>1 because multiple requests could have been fired at once, and perhaps 10 HTTP 403 errors are still being sent back to Scrapy for handling).

AFAIK Azure WAF sometimes only triggers after X number of requests in Y seconds. So the first 20 requests could be fine, then request 21 triggers a HTTP 403 error and CAPTCHA challenge from Azure WAF.


# Cloudflare bot protection documentation:
# https://developers.cloudflare.com/bots/
CLOUDFLARE = {"name": "Cloudflare", "zyte_bypassable": True}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How true is this? We hit a lot of CF sites already with no issues, we log the stats too iirc. I don't know the CF terminology, but there are different levels, we handle some fine, playwright handles some, but does Zyte work for them all? Or are there limits there too? Does it matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I've only deemed something to be "anti-bot" if there is a CAPTCHA or JavaScript challenge required to be completed by the client. Cloudflare and many other services have confusing marketing, but some will provide the company with a choice of implementing IP address geographic restrictions, data centre IP address blocking, CAPTCHA challenges, JavaScript challenges, rate limiting, etc.

I don't believe Zyte have a status page available where they advertise their current/live status of where they are in the arms race with anti-bot services. It therefore seems like we'd have to trial and error approach (perhaps even before each crawl commences, if Scrapy allows this) to populate the zyte_bypassable values.

if response.status == 403 and server_utf8.upper() == "MICROSOFT-AZURE-APPLICATION-GATEWAY/V2":
self.add_anti_bot_method(AntiBotMethods.AZURE_WAF, spider)

if cookies := response.headers.getlist("set-cookie"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scrapy is having a discussion about making cookies more accessible, I'll try and find the link tomorrow. We may want to be part of that conversation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants