-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SitemapSpider will ignore sitemap with URLs like https://website.com/filename.xml?from=7155352010944&to=7482320519360 #6293
Comments
It might be worth it to find out why the earlier |
I am not able to reproduce this locally on plain scrapy v.2.11.0 script.pyimport scrapy
from scrapy.crawler import CrawlerProcess as Cp
class SitemapTestSpider(scrapy.spiders.sitemap.SitemapSpider):
name = "quotes"
custom_settings = {"DOWNLOAD_DELAY": 1}
sitemap_urls = [
'https://hwpartstore.com/sitemap_products_8.xml?from=7155352010944&to=7482320519360',
'https://tornadoparts.com/sitemap_products_1.xml?from=1734178111555&to=1734707675203',
'https://www.mycnhistore.com/medias/sitemap-product-newhollandag-us-en-main.xml?context=bWFzdGVyfHJvb3R8NTA4OTY0MXx0ZXh0L3htbHxhREppTDJneVpTODVOVGc1TVRVMk9ETTVORFUwTDNOcGRHVnRZWEF0Y0hKdlpIVmpkQzF1Wlhkb2IyeHNZVzVrWVdjdGRYTXRaVzR0YldGcGJpNTRiV3d8ZWM0NDFlMDgwZmYzNTlkYjkzZWIwNGFhYzM0NGNlOWFmMjUzYjBhZWFjYTY3MDg5YjY5NWY1OTE2ODM2MTJjYQ'
]
def _get_sitemap_body(self, response):
# self.logger.info(f"data for {response.url}")
# headers = '\n\t\t'.join([f"{k}:{v}" for k,v in response.headers.items()])
# self.logger.info(f"{headers}")
self.logger.info(
f"{'!!!' if isinstance(response, scrapy.http.XmlResponse) else ''}"
f"{response.url} \n identified as {response.__class__} ")
if __name__ == "__main__":
proc = Cp(); proc.crawl(SitemapTestSpider); proc.start() log output
In this case response objects from all mentioned urls that reached to scrapy/scrapy/spiders/sitemap.py Lines 88 to 93 in 2f1d345
before scrapy/scrapy/spiders/sitemap.py Lines 117 to 118 in 2f1d345
Originally - scrapy create scrapy/scrapy/downloadermiddlewares/httpcompression.py Lines 138 to 150 in 02b97f9
|
Is it possible that the original problem happens on an older Scrapy version or with some SitemapSpider methods overridden? @seagatesoft |
Description
Some sitemaps are having URLs with parameters, examples:
The current implementation of
_get_sitemap_body
will fail to detect those URLs as sitemap because it does the following check:if response.url.endswith(".xml") or response.url.endswith(".xml.gz"):
So far I fixed the issue by overriding
_get_sitemap_body
to:The text was updated successfully, but these errors were encountered: