You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a robots.txt is encountered that incluces a BOM, not all files are respected. This is due to the BOM being included in the content passed to protego. When the content of robots.txt is passed to protego, the user-agent with the BOM is not a valid user-agent, so it is ignored.
One could argue that protego should handle that, but it seems more likely that only the content without the BOM should be passed to protego.
Steps to Reproduce
Execute the server.py below
Execute the spider.py below
Expected behavior:
No pages should be crawled as they should be blocked by robots.txt.
Actual behavior: [What actually happens]
A page is scanned because the robots.txt rule is ignored.
Also, in researching this bug, I came across this issue which has since been resolved. However in testing, I found that links weren't being properly extracted when using the example given there, which mine was based off.
It could have been user error and I didn't dive deep into that or create an MRE because that was not my concern at the moment, but it can be replicated, by serving content with random links, using the server as supplied in that issue, and changing the spider to parse links.
Description
When a robots.txt is encountered that incluces a BOM, not all files are respected. This is due to the BOM being included in the content passed to protego. When the content of robots.txt is passed to protego, the user-agent with the BOM is not a valid user-agent, so it is ignored.
One could argue that protego should handle that, but it seems more likely that only the content without the BOM should be passed to protego.
Steps to Reproduce
Expected behavior:
No pages should be crawled as they should be blocked by robots.txt.
Actual behavior: [What actually happens]
A page is scanned because the robots.txt rule is ignored.
Versions
Scrapy : 2.9.0
lxml : 4.9.2.0
libxml2 : 2.9.14
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.1
Twisted : 22.10.0
Python : 3.10.13 (main, Aug 24 2023, 22:48:59) [Clang 14.0.3 (clang-1403.0.22.14.1)]
pyOpenSSL : 23.2.0 (OpenSSL 3.1.1 30 May 2023)
cryptography : 41.0.1
Platform : macOS-14.1.2-x86_64-i386-64bit
Additional context
server.py
spider.py
The text was updated successfully, but these errors were encountered: