-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_active_size gives no warning when queue processing blocked (can cause deadlock when deferring items in a pipeline) #6206
Comments
I guess we could add a warning the first time the limit is reached, in case users want to consider increasing it (to improve performance if they can spare the memory). But we should also allow disabling that warning. |
@Gallaecio I'd suggest in the stats addon to show an indication only when over max_active_size. Currently there is nothing to show there is even anything in the queue so it wasn't obvious the problem wasn't items being dropped via errors or something vs just the queue being paused. or alternatively something external was slowing requests way down. |
this is.. core feature by design - not a bug. If application accumulated some amount of not processed responses - it stops to send new requests (to prevent RAM overhead).
It depends on the way how exactly new requests called inside pipeline (If I correctly interpret meaning of "deferring items in a pipeline that depend on other requests..") Can You please provide.. more detailed info at least dump stats log output (ending log entry starting with
Is this a broadcrawl case? |
Lines 197 to 205 in d5233bb
This code fragment define conditions when application send (or not send) new request:
@djay But it will be clear only after at least review of |
I will look into how that is done because this is the issue I'm having. To collect more information to include in an item I have a pipeline that generates a lot of requests that do increase active_size and this creates teh deadlock. Increasing SCRAPER_SLOT_MAX_ACTIVE_SIZE prevents the problem for longer. But the point of the ticket was more that it was very hard to work out what was going on and a warning would have helped a lot. |
As I mentioned earlier It still can be caused by timeouts. Request that returned timeout Exception (no Response returned) - doesn't count as crawled. During that period that from provided log entries looks like idle - application may send hundreds or even thousands of requests (with no responses). That why I've asked for
We still don't know this for sure. Calling new requests from pipeline is not very good practice. And as I mentioned earlier requests originated from pipeline may not reach scheduler, scraper(do not directly affect it's slot active size). It'd be good to share pipeline code here too. |
99% sure. If I increase SCRAPER_SLOT_MAX_ACTIVE_SIZE it solves the problem. I also used the debugger and the AUTOTHROTTLE_DEBUG=True to show that no requests are happening.
That maybe but my main point is that there is no warning about this. It's not in the documentation. It's not in the log messages. And really if you are stopping the the queue getting processed by hitting some limit I think it's reasonable to assume there should be some log message for this right? The documention only says a pipeline can return a Deferred. And that seems like a reasonable way to combine information from two different crawls into a single item. I don't actually spawn a request in the pipeline. I do it in another middleware. That then sets a special value in the item which includes a deferred for when the history crawl has been completed and a callback to set values on the item to replace the special values with the real values from the history.
Note I also tried to solve this by setting my history requests to be higher priority but it doesn't seem to be respected so that didn't work. If priority worked then the history requests would be completed before new items are generated and then deferred waiting for even more history requests, so the active_size doesn't increase as fast. |
Something like this would be helpful.
|
it seems like the comment below isnt the case according to documentation below. and the code - https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py It also seems like it is standard practice to use Deferred and spawn requests that depend on it from the pipeline - see https://docs.scrapy.org/en/latest/topics/media-pipeline.html
What this isn't telling you is that if the files are from a single domain and other urls aren't, then priority won't really help since the if you have any kind of scheduling that limits requests per domain since then other requests from other domains will skip ahead of hte priority requests, create more file requests and you can end up a deadlock. And have no idea why. |
ok. I think I see that self.cawler.engine.download instead of engine.crawl is the special sauce that is needed to prevent this https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/media.py#L144 Line 312 in d5233bb
|
Crawler.download doesn't seem to help. Even if it avoids adding to the active size the active size still grows and eventually blocks all requests thus preventing it shrinking. Why is a deferred item in a pipeline still considered part of active requests? That is the problem here. All that's left in the pipeline is the item. The request abd response aren't accessible. They are done. They can't be reversed. |
Description
When you hit the SCRAPER_SLOT_MAX_ACTIVE_SIZE requests stop being processed silently with no warning.
If you are deferring items in a pipeline that depend on other requests finishing before completing then you can get into a deadlock with little explanation why.
This is made worse by the
DownloaderAwarePriorityQueue
as this preferences less active sites so the bottleneck site causing the deadlock might never get processed before the limit is reached.[Description of the issue]
Steps to Reproduce
Expected behavior: [What you expect to happen]
An error when you hit max_active_size. Stats addon to show number of active requests.
Actual behavior: [What actually happens]
Nothing shown so hard to debug
Reproduces how often: [What percentage of the time does it reproduce?]
100%
Versions
2.11.0, 2.7.0
Please paste here the output of executing
scrapy version --verbose
in the command line.Additional context
Any additional information, configuration, data or output from commands that might be necessary to reproduce or understand the issue. Please try not to include screenshots of code or the command line, paste the contents as text instead. You can use GitHub Flavored Markdown to make the text look better.
The text was updated successfully, but these errors were encountered: