Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feed URL has been added to the stats. #5982

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

haideriqbal1
Copy link

In this PR I have added the feed URL to the stats of the spider's job. Now anybody can easily get the feed URL from the stats of a job instead of searching for it in the logs of a job.

Testing:
I have uploaded these changes to the internal PYPI server. You can easily install this version of Scrapy by running the below command.
pip install scrapy (Add remaining command from internal PYPI docs)

After installing this version of Scrapy you can run the below spider by the following command.
scrapy crawl feed_exporter_tester -o testing/test.json

After the completion of execution, you can check the stats of the job and hopefully, you will get a stat like this ('feedexport/feed_url/test.json': 'testing/test.json').

from scrapy import Spider
from scrapy import Item, Field
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst


class BookItem(Item):
    book_title = Field()
    book_price = Field()
    book_image_URL = Field()
    book_detail_page_URL = Field()


class BookItemLoader(ItemLoader):
    default_item_class = BookItem
    default_output_processor = TakeFirst()


class FeedExporterTesterSpider(Spider):
    name = "feed_exporter_tester"
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com/']

    def parse(self, response):
        books = response.xpath('//section/div/ol[contains(@class , "row")]//li')
        for book in books:
            book_il = BookItemLoader(selector=book)
            book_il.add_xpath('book_title', './/h3/a/@title')
            book_il.add_xpath('book_price', './/p[@class="price_color"]/text()')
            book_img_url_rel = book.xpath('.//div[@class="image_container"]/a/img/@src').extract_first()
            book_il.add_value('book_image_URL', response.urljoin(book_img_url_rel))
            detail_page_url_rel = book.xpath('.//h3/a/@href').extract_first()
            book_il.add_value('book_detail_page_URL', response.urljoin(detail_page_url_rel))
            yield book_il.load_item()

@codecov
Copy link

codecov bot commented Jul 22, 2023

Codecov Report

Merging #5982 (165e058) into master (8055a94) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 165e058 differs from pull request most recent head 504f5c8. Consider uploading reports for the commit 504f5c8 to get more accurate results

@@           Coverage Diff           @@
##           master    #5982   +/-   ##
=======================================
  Coverage   89.19%   89.19%           
=======================================
  Files         162      162           
  Lines       11287    11289    +2     
  Branches     1833     1833           
=======================================
+ Hits        10067    10069    +2     
  Misses        928      928           
  Partials      292      292           
Impacted Files Coverage Δ
scrapy/extensions/feedexport.py 95.84% <100.00%> (+0.02%) ⬆️

@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type):

def _handle_store_success(self, f, logmsg, spider, slot_type):
logger.info("Stored %s", logmsg, extra={"spider": spider})
feed_uri = logmsg.split(":")[-1].strip()
self.crawler.stats.set_value(f"feedexport/feed_url/{feed_uri.split('/')[-1]}", feed_uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a Scrapy Extensions that users can opt-in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you prefer? @Gallaecio @wRAR

@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type):

def _handle_store_success(self, f, logmsg, spider, slot_type):
logger.info("Stored %s", logmsg, extra={"spider": spider})
feed_uri = logmsg.split(":")[-1].strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid this type of text transformation, is quite possible it will break on the future if the format changes. If you need slot.uri you can pass it as an additional argument

@@ -480,6 +480,8 @@ def _handle_store_error(self, f, logmsg, spider, slot_type):

def _handle_store_success(self, f, logmsg, spider, slot_type):
logger.info("Stored %s", logmsg, extra={"spider": spider})
feed_uri = logmsg.split(":")[-1].strip()
self.crawler.stats.set_value(f"feedexport/feed_url/{feed_uri.split('/')[-1]}", feed_uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple feeds exported to the same top level path - For example, to the same bucket - only the last one will be keep. That behavior could be okish depending on your user case, but at least , it should be clearly documented.

@VMRuiz
Copy link
Contributor

VMRuiz commented Jul 24, 2023

A part from the concerns on my previous comments, you will also need to add some docs and unit tests for the PR.

@Gallaecio
Copy link
Member

To be honest, I am not sure it makes sense to add this to the stats at all.

What’s the scenario we want to address here? Not being sure that the right values are being used? Maybe logging the contents of FEEDS (if not done already) would be a better approach.

@wRAR
Copy link
Member

wRAR commented Jul 24, 2023

Yeah, I'm not sure if this is useful as well.

@VMRuiz
Copy link
Contributor

VMRuiz commented Jul 25, 2023

To be honest, I am not sure it makes sense to add this to the stats at all.

What’s the scenario we want to address here? Not being sure that the right values are being used? Maybe logging the contents of FEEDS (if not done already) would be a better approach.

The idea is to facilitate Spidermon monitors to check that FEEDs have been uploaded and also to have a quick way to find the upload FEED URL rather than searching across million of logs.

@Gallaecio
Copy link
Member

To have a quick way to find the upload FEED URL rather than searching across million of logs.

The log message would be at the beginning, so the number of log messages should not be an issue.

The idea is to facilitate Spidermon monitors to check that FEEDs have been uploaded.

This makes sense, and I get that a log message does not work for this, but I wonder if there is a better way, or at least if the stat should be defined differently.

Given there are feedexport/failed_count/<slot_type> and feedexport/success_count/<slot_type>, and assuming those do not work here because you want to detect cases where the slot type is the same but a specific feed URL is failing (e.g. maybe read access was revoked), what about making counterparts to those 2 stats that instead of the slot type indicate the feed URL template? So, similar to the suggestion here, but:

  • 2 stats, not only the success one.
  • The number of successes and failures is counted, this is not just about setting the start to a fixed value.
  • The prefix is feedexport/failed_count/, or maybe feedexport/failed_count/slot/ (or slots, by_slot, per_slot) to differentiate it more from the existing stat.

@wRAR
Copy link
Member

wRAR commented Jul 25, 2023

I feel like it's not the only time when something is added to stats just so spidermon is able to take it from there, but I don't think we have better ways to pass this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants