Feed URL has been added to the stats. #5982

haideriqbal1 · 2023-07-20T19:42:32Z

In this PR I have added the feed URL to the stats of the spider's job. Now anybody can easily get the feed URL from the stats of a job instead of searching for it in the logs of a job.

Testing:
I have uploaded these changes to the internal PYPI server. You can easily install this version of Scrapy by running the below command.
pip install scrapy (Add remaining command from internal PYPI docs)

After installing this version of Scrapy you can run the below spider by the following command.
scrapy crawl feed_exporter_tester -o testing/test.json

After the completion of execution, you can check the stats of the job and hopefully, you will get a stat like this ('feedexport/feed_url/test.json': 'testing/test.json').

from scrapy import Spider
from scrapy import Item, Field
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst


class BookItem(Item):
    book_title = Field()
    book_price = Field()
    book_image_URL = Field()
    book_detail_page_URL = Field()


class BookItemLoader(ItemLoader):
    default_item_class = BookItem
    default_output_processor = TakeFirst()


class FeedExporterTesterSpider(Spider):
    name = "feed_exporter_tester"
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com/']

    def parse(self, response):
        books = response.xpath('//section/div/ol[contains(@class , "row")]//li')
        for book in books:
            book_il = BookItemLoader(selector=book)
            book_il.add_xpath('book_title', './/h3/a/@title')
            book_il.add_xpath('book_price', './/p[@class="price_color"]/text()')
            book_img_url_rel = book.xpath('.//div[@class="image_container"]/a/img/@src').extract_first()
            book_il.add_value('book_image_URL', response.urljoin(book_img_url_rel))
            detail_page_url_rel = book.xpath('.//h3/a/@href').extract_first()
            book_il.add_value('book_detail_page_URL', response.urljoin(detail_page_url_rel))
            yield book_il.load_item()

codecov · 2023-07-22T18:48:58Z

Codecov Report

Merging #5982 (165e058) into master (8055a94) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 165e058 differs from pull request most recent head 504f5c8. Consider uploading reports for the commit 504f5c8 to get more accurate results

@@           Coverage Diff           @@
##           master    #5982   +/-   ##
=======================================
  Coverage   89.19%   89.19%           
=======================================
  Files         162      162           
  Lines       11287    11289    +2     
  Branches     1833     1833           
=======================================
+ Hits        10067    10069    +2     
  Misses        928      928           
  Partials      292      292

Impacted Files	Coverage Δ
scrapy/extensions/feedexport.py	`95.84% <100.00%> (+0.02%)`	⬆️

VMRuiz · 2023-07-24T11:15:43Z

scrapy/extensions/feedexport.py

@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type):

    def _handle_store_success(self, f, logmsg, spider, slot_type):
        logger.info("Stored %s", logmsg, extra={"spider": spider})
+        feed_uri = logmsg.split(":")[-1].strip()
+        self.crawler.stats.set_value(f"feedexport/feed_url/{feed_uri.split('/')[-1]}", feed_uri)


Maybe this should be a Scrapy Extensions that users can opt-in

What would you prefer? @Gallaecio @wRAR

VMRuiz · 2023-07-24T11:17:02Z

scrapy/extensions/feedexport.py

@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type):

    def _handle_store_success(self, f, logmsg, spider, slot_type):
        logger.info("Stored %s", logmsg, extra={"spider": spider})
+        feed_uri = logmsg.split(":")[-1].strip()


Avoid this type of text transformation, is quite possible it will break on the future if the format changes. If you need slot.uri you can pass it as an additional argument

VMRuiz · 2023-07-24T11:18:30Z

scrapy/extensions/feedexport.py

@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type):

    def _handle_store_success(self, f, logmsg, spider, slot_type):
        logger.info("Stored %s", logmsg, extra={"spider": spider})
+        feed_uri = logmsg.split(":")[-1].strip()
+        self.crawler.stats.set_value(f"feedexport/feed_url/{feed_uri.split('/')[-1]}", feed_uri)


Please, use alternatives like https://docs.python.org/3/library/pathlib.html or https://docs.python.org/3/library/urllib.parse.html to operate with URIs paths

If there are multiple feeds exported to the same top level path - For example, to the same bucket - only the last one will be keep. That behavior could be okish depending on your user case, but at least , it should be clearly documented.

VMRuiz · 2023-07-24T11:23:58Z

A part from the concerns on my previous comments, you will also need to add some docs and unit tests for the PR.

Gallaecio · 2023-07-24T12:45:49Z

To be honest, I am not sure it makes sense to add this to the stats at all.

What’s the scenario we want to address here? Not being sure that the right values are being used? Maybe logging the contents of FEEDS (if not done already) would be a better approach.

wRAR · 2023-07-24T13:31:54Z

Yeah, I'm not sure if this is useful as well.

VMRuiz · 2023-07-25T06:45:16Z

To be honest, I am not sure it makes sense to add this to the stats at all.

What’s the scenario we want to address here? Not being sure that the right values are being used? Maybe logging the contents of FEEDS (if not done already) would be a better approach.

The idea is to facilitate Spidermon monitors to check that FEEDs have been uploaded and also to have a quick way to find the upload FEED URL rather than searching across million of logs.

Gallaecio · 2023-07-25T08:43:02Z

To have a quick way to find the upload FEED URL rather than searching across million of logs.

The log message would be at the beginning, so the number of log messages should not be an issue.

The idea is to facilitate Spidermon monitors to check that FEEDs have been uploaded.

This makes sense, and I get that a log message does not work for this, but I wonder if there is a better way, or at least if the stat should be defined differently.

Given there are feedexport/failed_count/<slot_type> and feedexport/success_count/<slot_type>, and assuming those do not work here because you want to detect cases where the slot type is the same but a specific feed URL is failing (e.g. maybe read access was revoked), what about making counterparts to those 2 stats that instead of the slot type indicate the feed URL template? So, similar to the suggestion here, but:

2 stats, not only the success one.
The number of successes and failures is counted, this is not just about setting the start to a fixed value.
The prefix is feedexport/failed_count/, or maybe feedexport/failed_count/slot/ (or slots, by_slot, per_slot) to differentiate it more from the existing stat.

wRAR · 2023-07-25T11:50:39Z

I feel like it's not the only time when something is added to stats just so spidermon is able to take it from there, but I don't think we have better ways to pass this.

Feed URL has been added to the stats.

504f5c8

VMRuiz suggested changes Jul 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feed URL has been added to the stats. #5982

Feed URL has been added to the stats. #5982

haideriqbal1 commented Jul 20, 2023

codecov bot commented Jul 22, 2023 •

edited

VMRuiz Jul 24, 2023

VMRuiz Jul 24, 2023

VMRuiz Jul 24, 2023

VMRuiz Jul 24, 2023

VMRuiz Jul 24, 2023

VMRuiz commented Jul 24, 2023

Gallaecio commented Jul 24, 2023

wRAR commented Jul 24, 2023

VMRuiz commented Jul 25, 2023

Gallaecio commented Jul 25, 2023

wRAR commented Jul 25, 2023

Feed URL has been added to the stats. #5982

Are you sure you want to change the base?

Feed URL has been added to the stats. #5982

Conversation

haideriqbal1 commented Jul 20, 2023

codecov bot commented Jul 22, 2023 • edited

Codecov Report

VMRuiz Jul 24, 2023

Choose a reason for hiding this comment

VMRuiz Jul 24, 2023

Choose a reason for hiding this comment

VMRuiz Jul 24, 2023

Choose a reason for hiding this comment

VMRuiz Jul 24, 2023

Choose a reason for hiding this comment

VMRuiz Jul 24, 2023

Choose a reason for hiding this comment

VMRuiz commented Jul 24, 2023

Gallaecio commented Jul 24, 2023

wRAR commented Jul 24, 2023

VMRuiz commented Jul 25, 2023

Gallaecio commented Jul 25, 2023

wRAR commented Jul 25, 2023

codecov bot commented Jul 22, 2023 •

edited