New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scrapy.item.Field memory leak. #5995
Comments
The obvious next step is to disable your pipelines. I don't think this repoirt is actionable as-is. |
And, on one hand having a lot of |
@wRAR Moreover, I understand that it's normal for a long-running spider to occupy around 2GB of memory. However, my concern lies in the fact that since my spider is scheduled through schedule.py, it is not running continuously. I think when the lifecycle of each spider ends, all Items should be destroyed, along with their Fields. According to my logs, it appears that they are not being destroyed until they accumulate to 200+120MB after half a day. I have also used scrapy.utils.trackref.print_live_refs to confirm that no spiders were running at that time. Additionally, you mentioned the possibility of dynamically created Fields being cached somewhere, and I will investigate this in my code. Once again, thank you for your attention to this matter, and I will keep you updated with the logs as soon as I have them. |
Makes sense. As you have a long-lived CrawlerRunner instance you can try looking if there is any large data reachable from it (manually or using tools).
I actually meant the Scrapy code, I don't know where can it happen if at all, but it's possible. |
I disabled all the pipelines, but there was nothing changed. Even the output from muppy was almost the same. Here is the log:
I tried using
Can you recommend any way or tools can help me to locate those "large data" objects? And could you try to reproduce this issue? |
You haven't provided a minimal reproducible example so no. |
In the previous step, we ruled out the problems with the pipelines. Therefore, I believe that by pasting the provided "schedule.py" into the same directory as "scrapy.cfg" and creating a YAML file as shown below, and finally, finding a few spiders and changing yield Item to yield dict_to_item("ItemClass", {**data}), this should enough as a MRE. Do you need me to provide more code? Please let me know.
|
I want you to provide a minimal reproducible example. What you are suggesting is not minimal, is not complete and after you get a minimal complete one you should reproduce the problem with it yourself before offering it. |
OK, this will take some time, and in the process I will also find some way to locate those "large data". Looks like it's going to be a big project.I will let you know if there is any progress, or give you a MRE repo. |
I've located the leak point using
|
Sorry, which middleware? Can you point to code that does this as it's not clear from the output you provided?
No. Specifically, spider middlewares receive the callback results in |
I apologize for I am not rigorous enough, at this stage, I cannot sure whether it's the middleware cached something or Twisted Internet did it, although I have suspicions about the middleware being the cause. So my next step is disable these 4 SPIDER MIDDLEWARES that have been called in the call stack. Can I disable the SPIDER MIDDLEWARES like this?
|
Yes. |
I turned off the middlewares, but the leak still occurred. It seems likely that twisted internet or the scrapy core might have done something. Here is the output for reference:
|
I don't know how to interpret this output but it looks like it just shows where the objects were created? |
@BillYu811
In general case we expect that one or several I don't remember anything like dynamic item (in terms that multiple
I don't think that You need to return items as def dict_to_item(class_name: str, dictionary: dict):
cache = get_project_settings().get("DICT_TO_ITEM_CACHE", None) # <may shared between multiple spider runs.
if cache:
if cache.get(class_name,None):
item_cls = cache[class_name]
else:
item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
cache[class_name] = item_cls
return item_cls().update(dictionary)
return type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})().update(dictionary) Inside And Running multiple spiders in single process using |
Thank you very much for your response. It seems using a
Yes, you're right. The documentation doesn't mention anything about dynamic items. The reason I attempted to use dynamic items is as mentioned before: for some simple crawlers, I can use explicitly defined items, but for more complex ones, I can't determine all field names in advance (I need to dynamically define the item). I wanted to only one standardize pipeline to handle all data, hence I explored this approach based on input from ChatGPT.
fist time, I encountered the leak when using the dict_to_item as :
I thought the leak might be due to repetitive definitions of item_cls, so I implemented the dict_to_item with caching (you mentioned). However, the leak still persisted, leading me to believe that it might not be the cache (or the duplicate field definitions) causing the issue.
Just as I mentioned the lifecycle of spider, I believe that when a spider's lifecycle ends, the items it uses should be released, and consequently, the fields should be released as well. Therefore, I think dynamically defined items and fields would be destroyed at the end of a TaskUnit, thats should be makes sense. |
Description
yield dict_to_item("ItemClass", {**data})
(implement by the following codes) to convert data_dict to item, In the pipeline, data will inserted into different mongo collections by differentitem.__class__.__name__
python schedule.py
, it takes 2 GB of memory after one day. Then my docker kills it.Additional
Versions
python 3.8
scrapy 2.7.0 / 2.5.1 (tried both)
The text was updated successfully, but these errors were encountered: