Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The first rule in a robots.txt with BOM will be ignored #6195

Open
Gidgidonihah opened this issue Jan 4, 2024 · 1 comment
Open

The first rule in a robots.txt with BOM will be ignored #6195

Gidgidonihah opened this issue Jan 4, 2024 · 1 comment

Comments

@Gidgidonihah
Copy link

Description

When a robots.txt is encountered that incluces a BOM, not all files are respected. This is due to the BOM being included in the content passed to protego. When the content of robots.txt is passed to protego, the user-agent with the BOM is not a valid user-agent, so it is ignored.

One could argue that protego should handle that, but it seems more likely that only the content without the BOM should be passed to protego.

Steps to Reproduce

  1. Execute the server.py below
  2. Execute the spider.py below

Expected behavior:

No pages should be crawled as they should be blocked by robots.txt.

Actual behavior: [What actually happens]

A page is scanned because the robots.txt rule is ignored.

Versions

Scrapy : 2.9.0
lxml : 4.9.2.0
libxml2 : 2.9.14
cssselect : 1.2.0
parsel : 1.8.1
w3lib : 2.1.1
Twisted : 22.10.0
Python : 3.10.13 (main, Aug 24 2023, 22:48:59) [Clang 14.0.3 (clang-1403.0.22.14.1)]
pyOpenSSL : 23.2.0 (OpenSSL 3.1.1 30 May 2023)
cryptography : 41.0.1
Platform : macOS-14.1.2-x86_64-i386-64bit

Additional context

server.py

import codecs
from http.server import BaseHTTPRequestHandler
from http.server import HTTPServer


class HttpGetHandler(BaseHTTPRequestHandler):
    """Basic handler for bug MRE."""

    def write_robots_txt(self):
        """Write robots.txt content."""
        self.wfile.write("User-agent: *\n".encode("utf8"))
        self.wfile.write("Disallow: /".encode("utf8"))

    def write_content(self):
        """Write page content."""
        self.wfile.write(codecs.BOM_UTF8)
        self.wfile.write("<!DOCTYPE html>".encode("utf8"))
        self.wfile.write("Olá,".encode("utf8"))

    def do_GET(self):
        """Handle all GET requests."""
        self.send_response(200)
        self.end_headers()

        if self.path == "/robots.txt":
            return self.write_robots_txt()
        return self.write_content()


if __name__ == "__main__":
    httpd = HTTPServer(("", 8000), HttpGetHandler)
    httpd.serve_forever()

spider.py

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors import LinkExtractor


class MySpider(scrapy.Spider):
    name = "test"

    start_urls = ["http://0.0.0.0:8000"]
    custom_settings = {
        "ROBOTSTXT_OBEY": True,
        "DEPTH_LIMIT": 5,
    }

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.link_extractor = LinkExtractor()

    def parse(self, response):
        """Parse the response and crawl links."""
        self.logger.info("Parsing complete for %s", response.url)
        self.logger.debug({"encoding": response.encoding, "text": response.text})


if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(MySpider)
    process.start()
@Gidgidonihah
Copy link
Author

Also, in researching this bug, I came across this issue which has since been resolved. However in testing, I found that links weren't being properly extracted when using the example given there, which mine was based off.

It could have been user error and I didn't dive deep into that or create an MRE because that was not my concern at the moment, but it can be replicated, by serving content with random links, using the server as supplied in that issue, and changing the spider to parse links.

for link in self.link_extractor.extract_links(response):
            yield scrapy.Request(link.url, callback=self.parse)

I leave this (mostly unrelated) comment here in case anyone reads this and wants to pick up on that thread as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant