The first rule in a robots.txt with BOM will be ignored #6195

Gidgidonihah · 2024-01-04T19:48:48Z

Description

When a robots.txt is encountered that incluces a BOM, not all files are respected. This is due to the BOM being included in the content passed to protego. When the content of robots.txt is passed to protego, the user-agent with the BOM is not a valid user-agent, so it is ignored.

One could argue that protego should handle that, but it seems more likely that only the content without the BOM should be passed to protego.

Steps to Reproduce

Execute the server.py below
Execute the spider.py below

Expected behavior:

No pages should be crawled as they should be blocked by robots.txt.

Actual behavior: [What actually happens]

A page is scanned because the robots.txt rule is ignored.

Versions

Scrapy : 2.9.0
lxml : 4.9.2.0
libxml2 : 2.9.14
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.1
Twisted : 22.10.0
Python : 3.10.13 (main, Aug 24 2023, 22:48:59) [Clang 14.0.3 (clang-1403.0.22.14.1)]
pyOpenSSL : 23.2.0 (OpenSSL 3.1.1 30 May 2023)
cryptography : 41.0.1
Platform : macOS-14.1.2-x86_64-i386-64bit

Additional context

server.py

import codecs
from http.server import BaseHTTPRequestHandler
from http.server import HTTPServer


class HttpGetHandler(BaseHTTPRequestHandler):
    """Basic handler for bug MRE."""

    def write_robots_txt(self):
        """Write robots.txt content."""
        self.wfile.write("User-agent: *\n".encode("utf8"))
        self.wfile.write("Disallow: /".encode("utf8"))

    def write_content(self):
        """Write page content."""
        self.wfile.write(codecs.BOM_UTF8)
        self.wfile.write("<!DOCTYPE html>".encode("utf8"))
        self.wfile.write("Olá,".encode("utf8"))

    def do_GET(self):
        """Handle all GET requests."""
        self.send_response(200)
        self.end_headers()

        if self.path == "/robots.txt":
            return self.write_robots_txt()
        return self.write_content()


if __name__ == "__main__":
    httpd = HTTPServer(("", 8000), HttpGetHandler)
    httpd.serve_forever()

spider.py

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor


class MySpider(scrapy.Spider):
    name = "test"

    start_urls = ["http://0.0.0.0:8000"]
    custom_settings = {
        "ROBOTSTXT_OBEY": True,
        "DEPTH_LIMIT": 5,
    }

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.link_extractor = LinkExtractor()

    def parse(self, response):
        """Parse the response and crawl links."""
        self.logger.info("Parsing complete for %s", response.url)
        self.logger.debug({"encoding": response.encoding, "text": response.text})


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(MySpider)
    process.start()

Gidgidonihah · 2024-01-04T19:52:03Z

Also, in researching this bug, I came across this issue which has since been resolved. However in testing, I found that links weren't being properly extracted when using the example given there, which mine was based off.

It could have been user error and I didn't dive deep into that or create an MRE because that was not my concern at the moment, but it can be replicated, by serving content with random links, using the server as supplied in that issue, and changing the spider to parse links.

for link in self.link_extractor.extract_links(response):
            yield scrapy.Request(link.url, callback=self.parse)

I leave this (mostly unrelated) comment here in case anyone reads this and wants to pick up on that thread as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first rule in a robots.txt with BOM will be ignored #6195

The first rule in a robots.txt with BOM will be ignored #6195

Gidgidonihah commented Jan 4, 2024

Gidgidonihah commented Jan 4, 2024

The first rule in a robots.txt with BOM will be ignored #6195

The first rule in a robots.txt with BOM will be ignored #6195

Comments

Gidgidonihah commented Jan 4, 2024

Description

Steps to Reproduce

Versions

Additional context

server.py

spider.py

Gidgidonihah commented Jan 4, 2024