Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_active_size gives no warning when queue processing blocked (can cause deadlock when deferring items in a pipeline) #6206

Open
djay opened this issue Jan 15, 2024 · 11 comments

Comments

@djay
Copy link

djay commented Jan 15, 2024

Description

When you hit the SCRAPER_SLOT_MAX_ACTIVE_SIZE requests stop being processed silently with no warning.
If you are deferring items in a pipeline that depend on other requests finishing before completing then you can get into a deadlock with little explanation why.
This is made worse by the DownloaderAwarePriorityQueue as this preferences less active sites so the bottleneck site causing the deadlock might never get processed before the limit is reached.

[Description of the issue]

Steps to Reproduce

  1. Run a crawl with a pipeline that will defer items. This will mean you hit max_active_size fast.
  2. Notice no warning or error

Expected behavior: [What you expect to happen]

An error when you hit max_active_size. Stats addon to show number of active requests.

Actual behavior: [What actually happens]

Nothing shown so hard to debug

Reproduces how often: [What percentage of the time does it reproduce?]

100%

Versions

2.11.0, 2.7.0

Please paste here the output of executing scrapy version --verbose in the command line.

Scrapy       : 2.11.0
lxml         : 4.9.1.0
libxml2      : 2.9.14
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 2.0.1
Twisted      : 22.8.0
Python       : 3.9.18 (main, Oct  3 2023, 08:56:37) - [Clang 12.0.0 (clang-1200.0.32.2)]
pyOpenSSL    : 22.1.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 38.0.1
Platform     : macOS-10.16-x86_64-i386-64bit

Additional context

Any additional information, configuration, data or output from commands that might be necessary to reproduce or understand the issue. Please try not to include screenshots of code or the command line, paste the contents as text instead. You can use GitHub Flavored Markdown to make the text look better.

@djay djay changed the title max_active_size can cause deadlock when deferring items in a pipeline max_active_size gives no warning when queue processing blocked (can cause deadlock when deferring items in a pipeline) Jan 15, 2024
@Gallaecio
Copy link
Member

Gallaecio commented Jan 15, 2024

I guess we could add a warning the first time the limit is reached, in case users want to consider increasing it (to improve performance if they can spare the memory). But we should also allow disabling that warning.

@djay
Copy link
Author

djay commented Jan 16, 2024

@Gallaecio I'd suggest in the stats addon to show an indication only when over max_active_size. Currently there is nothing to show there is even anything in the queue so it wasn't obvious the problem wasn't items being dropped via errors or something vs just the queue being paused. or alternatively something external was slowing requests way down.

@GeorgeA92
Copy link
Contributor

When you hit the SCRAPER_SLOT_MAX_ACTIVE_SIZE requests stop being processed silently with no warning.

Expected behavior: 
An error when you hit max_active_size. Stats addon to show number of active requests.

this is.. core feature by design - not a bug. If application accumulated some amount of not processed responses - it stops to send new requests (to prevent RAM overhead).
If application really got stuck because of SCRAPER_SLOT_MAX_ACTIVE_SIZE - it most likely mean that it got stuck because of CPU bottleneck (response processing rate - slower than request sending rate) and in this case it's most likely duplicate of #5302

If you are deferring items in a pipeline that depend on other requests finishing before completing then you can get into a deadlock with little explanation why.

It depends on the way how exactly new requests called inside pipeline (If I correctly interpret meaning of "deferring items in a pipeline that depend on other requests..")
For example image requests from Scrapy ImagePipeline - doesn't affect scraper.slot.active_size.
Scraper slot is expected to affect responses that aimed to be processed by spider callback methods. Responses from other requests(called from pipeline) is not expected to appear to scraper (and to increase it's active_size value).
Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.

Can You please provide.. more detailed info at least dump stats log output (ending log entry starting with [scrapy.statscollectors] INFO: Dumping Scrapy stats:

as this preferences less active sites so the bottleneck site causing...

Is this a broadcrawl case?

@GeorgeA92
Copy link
Contributor

def _needs_backout(self) -> bool:
assert self.slot is not None # typing
assert self.scraper.slot is not None # typing
return (
not self.running
or bool(self.slot.closing)
or self.downloader.needs_backout()
or self.scraper.slot.needs_backout()
)

This code fragment define conditions when application send (or not send) new request:

  1. not self.running or bool(self.slot.closing) if closespider signal triggered (not this case)
  2. self.downloader.needs_backout() if Downloader queue reached size of CONCURRENT_REQUESTS
  3. self.scraper.slot.needs_backout() if application accumulated some not processed responses - SCRAPER_SLOT_MAX_ACTIVE_SIZE check.

@djay
How did You conclude that issue is because of SCRAPER_SLOT_MAX_ACTIVE_SIZE (not busy downloader)?
If it is really broadcrawl - I'd bet on version with downloader bottleneck by timeout exceptions (not scraper slot active size) mentioned on https://docs.scrapy.org/en/latest/topics/broad-crawls.html?highlight=broad-crawls#reduce-download-timeout

But it will be clear only after at least review of [scrapy.statscollectors] INFO: Dumping Scrapy stats log entry I requested.

@djay
Copy link
Author

djay commented Jan 16, 2024

Can You please provide.. more detailed info at least dump stats log output

...
2024-01-17 00:16:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:18:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:19:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:20:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:21:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:22:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:23:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:24:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)

For example image requests from Scrapy ImagePipeline - doesn't affect scraper.slot.active_size

I will look into how that is done because this is the issue I'm having. To collect more information to include in an item I have a pipeline that generates a lot of requests that do increase active_size and this creates teh deadlock. Increasing SCRAPER_SLOT_MAX_ACTIVE_SIZE prevents the problem for longer.

But the point of the ticket was more that it was very hard to work out what was going on and a warning would have helped a lot.

@GeorgeA92
Copy link
Contributor

2024-01-17 00:16:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min)
2024-01-17 00:18:58..

As I mentioned earlier It still can be caused by timeouts. Request that returned timeout Exception (no Response returned) - doesn't count as crawled. During that period that from provided log entries looks like idle - application may send hundreds or even thousands of requests (with no responses). That why I've asked for [scrapy.statscollectors] INFO: Dumping Scrapy stats log entry - not log entries from [scrapy.extensions.logstats]
Alternatively You can try to apply PeriodicLog extension that log required stats periodically(and post it's output here) mentioned on. https://docs.scrapy.org/en/latest/topics/extensions.html#periodic-log-extension.

To collect more information to include in an item I have a pipeline that generates a lot of requests that do increase active_size and this creates teh deadlock

We still don't know this for sure. Calling new requests from pipeline is not very good practice. And as I mentioned earlier requests originated from pipeline may not reach scheduler, scraper(do not directly affect it's slot active size). It'd be good to share pipeline code here too.

@djay
Copy link
Author

djay commented Jan 17, 2024

We still don't know this for sure.

99% sure. If I increase SCRAPER_SLOT_MAX_ACTIVE_SIZE it solves the problem. I also used the debugger and the AUTOTHROTTLE_DEBUG=True to show that no requests are happening.

Calling new requests from pipeline is not very good practice.

That maybe but my main point is that there is no warning about this. It's not in the documentation. It's not in the log messages. And really if you are stopping the the queue getting processed by hitting some limit I think it's reasonable to assume there should be some log message for this right?

The documention only says a pipeline can return a Deferred. And that seems like a reasonable way to combine information from two different crawls into a single item. I don't actually spawn a request in the pipeline. I do it in another middleware. That then sets a special value in the item which includes a deferred for when the history crawl has been completed and a callback to set values on the item to replace the special values with the real values from the history.
This seemed what returning a Deferred was designed for. What is unexpected is that doing so keeps the original request as still active despite request or response not being involved in the pipeline. I would not have expected that and that is partly why it was so confusing that using Deferred in the pipeline caused a problem.

class DeferHistoryPipeline:
    """ Add in history once history async is complete """
    def process_item(self, item, spider):
        # return item
        i = ItemAdapter(item)
        d = None
        callbacks = []
        for value in i.values():
            if type(value) == tuple and isinstance(value[0], Deferred):
                d, callback = value
                callbacks.append(callback)
        if not callbacks:
            return item

        def do_callbacks(history):
            for needs_item in callbacks:
                needs_item(i, history)
            return item
        return d.addBoth(do_callbacks)

Note I also tried to solve this by setting my history requests to be higher priority but it doesn't seem to be respected so that didn't work. If priority worked then the history requests would be completed before new items are generated and then deferred waiting for even more history requests, so the active_size doesn't increase as fast.
Or if the requests were forgotten by the pipeline step so didn't increase the active_size.
Either of these would be nice. But really all I'm asking for is a warning so others don't waste as much time as I did trying to debug something which is apparently not catered for.

@djay
Copy link
Author

djay commented Jan 17, 2024

Something like this would be helpful.

2024-01-17 00:17:58 [scrapy.extensions.logstats] INFO: Crawled 2678 pages (at 0 pages/min), scraped 1636 items (at 0 items/min) - paused (213433 > SCRAPER_SLOT_MAX_ACTIVE_SIZE)

@djay
Copy link
Author

djay commented Jan 17, 2024

Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.

it seems like the comment below isnt the case according to documentation below. and the code - https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py

It also seems like it is standard practice to use Deferred and spawn requests that depend on it from the pipeline - see https://docs.scrapy.org/en/latest/topics/media-pipeline.html

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).

What this isn't telling you is that if the files are from a single domain and other urls aren't, then priority won't really help since the if you have any kind of scheduling that limits requests per domain since then other requests from other domains will skip ahead of hte priority requests, create more file requests and you can end up a deadlock. And have no idea why.

@djay
Copy link
Author

djay commented Jan 17, 2024

Also requests from from ImagePipeline - do not appear on scheduler which means that DownloaderAwarePriorityQueue didn't applied to them.

ok. I think I see that self.cawler.engine.download instead of engine.crawl is the special sauce that is needed to prevent this

https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py#L144

self.slot.remove_request(request)

@djay
Copy link
Author

djay commented Jan 18, 2024

Crawler.download doesn't seem to help. Even if it avoids adding to the active size the active size still grows and eventually blocks all requests thus preventing it shrinking.

Why is a deferred item in a pipeline still considered part of active requests? That is the problem here. All that's left in the pipeline is the item. The request abd response aren't accessible. They are done. They can't be reversed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants