Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapy.item.Field memory leak. #5995

Open
BillYu811 opened this issue Aug 2, 2023 · 17 comments
Open

scrapy.item.Field memory leak. #5995

BillYu811 opened this issue Aug 2, 2023 · 17 comments

Comments

@BillYu811
Copy link

BillYu811 commented Aug 2, 2023

Description

  • I generate a CrawlerRunner as the official docs suggest (https://docs.scrapy.org/en/latest/topics/practices.html?highlight=in%20script#run-scrapy-from-a-script), On this basis, I achieve schedule tasks by reading the scheduled.yml file by the following codes.
  • I usually use yield dict_to_item("ItemClass", {**data}) (implement by the following codes) to convert data_dict to item, In the pipeline, data will inserted into different mongo collections by different item.__class__.__name__
  • when I run python schedule.py, it takes 2 GB of memory after one day. Then my docker kills it.
  • I use muppy to check what was hogging the memory, got the following results, It seems that only scrapy.item.Field is not released, so I ruled out the possibility of a memory leak in the Spider class.
  • I checked all documentations and stackoverflow, I didn't find similar question, so I try to get answer here. please let me know if my formulation confuses you or if I have made some low-level mistakes in my code. At the moment I think there may be something wrong with the way I implement the scheduled job, or something wrong with dict_to_item(), And I can't rule out that it may be a python garbage collection problem or a scrapy bug.
    def debug_memory_leak():
        logging.warning("=" * 50 + "MEMORY LEAK DEBUG START" + "=" * 50)
        all_objects = muppy.get_objects()
        suml = summary.summarize(all_objects)
        summary.print_(suml)
        logging.warning("=" * 50 + "MEMORY LEAK DEBUG END" + "=" * 50)
    
    schedule_from_yml()
    LoopingCall(debug_memory_lake).start(60 * 30)
    
    #RESULTS( after half of one day running):
                      types |   # objects |   total size
          scrapy.item.Field |      824855 |    195.09 MB
                       dict |      352841 |    126.81 MB
       scrapy.item.ItemMeta |      117544 |    119.27 MB
                  frozenset |      118478 |     25.48 MB
                      tuple |      320417 |     19.94 MB
                        set |       59906 |     12.59 MB
                    weakref |      182900 |     12.56 MB
                        str |       77587 |      9.03 MB
          function (remove) |       59033 |      7.66 MB
                  _abc_data |      117928 |      5.40 MB
                       code |       22258 |      3.98 MB
                        int |      126113 |      3.84 MB
                       type |        3954 |      3.66 MB
                       list |       61629 |      3.63 MB
    weakref.WeakKeyDictionary |       58999 |      2.70 MB
    
    
    
# schedule.py
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor, defer
from scrapy.utils.log import configure_logging
from croniter import croniter
import datetime as dt
import logging, yaml, sys, os

configure_logging(get_project_settings())
runner = CrawlerRunner(get_project_settings())


def schedule_next_crawl_cron(null, spider, expression):
    now = dt.datetime.now()
    _next = croniter(expression, now).get_next(dt.datetime)
    sleep_time = int(_next.timestamp() - now.timestamp())
    logging.warning('<{}> cron -{}- next run at {} (after {} seconds).'.format(spider, expression, _next, sleep_time))
    reactor.callLater(sleep_time, crawl_cron, spider, expression, True)


def crawl_job(spider, run_at_scheduled, kwargs):
    if run_at_scheduled:
        return runner.crawl(spider, **kwargs)
    else:
        return defer.succeed([])

def crawl_cron(spider, expression, run_at_scheduled=False, kwargs=None):
    if kwargs is None:
        kwargs = {}
    d = crawl_job(spider, run_at_scheduled, kwargs)
    d.addCallback(schedule_next_crawl_cron, spider, expression)

def schedule_from_yml(file='scheduled.yml'):
    file = sys.path[0] + '/' + file
    with open(file) as f:
        scheduled = yaml.safe_load(f).get('scheduled', [])
    _allowed_method = ['interval', 'at_time', 'cron']
    for s in scheduled:
        logging.warning('add schedule {} from {}'.format(s, file))
        if s['method'] in _allowed_method:
            globals()['crawl_' + s.pop('method')](**s)


schedule_from_yml()
reactor.run()
# item.py
def dict_to_item(class_name: str, dictionary: dict):
    item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
    return item_cls(**dictionary)

Additional

  • I've tried setting JOB_DIR and run schedule.py, but nothings changed.
  • At first I thought it was caused by the repeated definition of scrapy.Field(), so I changed dict_to_item() to cache the defined item_cls to a setting dict,it's like explicitly define an item, but it not works.
    def dict_to_item(class_name: str, dictionary: dict):
        cache = get_project_settings().get("DICT_TO_ITEM_CACHE", None)
        if cache:
            if cache.get(class_name,None):
                item_cls = cache[class_name]
            else:
                item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
                cache[class_name] = item_cls
            return item_cls().update(dictionary)
        return type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})().update(dictionary)
    
  • I know perhaps I should explicitly define an item. but I can not sure every field_names in some my spiders, in this case , I will use dict_to_item(). I have only one pipeline to process all crawler data, This pipeline inserts data into different mongo collections by item.class.name( for example inserting ApplePieItem into apple_pies table).

Versions

python 3.8
scrapy 2.7.0 / 2.5.1 (tried both)

@wRAR
Copy link
Member

wRAR commented Aug 2, 2023

The obvious next step is to disable your pipelines.

I don't think this repoirt is actionable as-is.

@wRAR
Copy link
Member

wRAR commented Aug 2, 2023

And, on one hand having a lot of Field and ItemMeta objects may suggest all of those dynamically created item objects are cached somewhere, on the other hand using 2 GB in a long-running spider is really not much and finding that some things use 200+120 MB may even be a red herring.

@BillYu811
Copy link
Author

And, on one hand having a lot of Field and ItemMeta objects may suggest all of those dynamically created item objects are cached somewhere, on the other hand using 2 GB in a long-running spider is really not much and finding that some things use 200+120 MB may even be a red herring.

@wRAR
Thank you for your response, and I truly appreciate you taking the time to look into my issue. I will disable the pipelines and restart my spider for 0.5~1 day, then post logs here.

Moreover, I understand that it's normal for a long-running spider to occupy around 2GB of memory. However, my concern lies in the fact that since my spider is scheduled through schedule.py, it is not running continuously. I think when the lifecycle of each spider ends, all Items should be destroyed, along with their Fields. According to my logs, it appears that they are not being destroyed until they accumulate to 200+120MB after half a day. I have also used scrapy.utils.trackref.print_live_refs to confirm that no spiders were running at that time.

Additionally, you mentioned the possibility of dynamically created Fields being cached somewhere, and I will investigate this in my code.

Once again, thank you for your attention to this matter, and I will keep you updated with the logs as soon as I have them.

@wRAR
Copy link
Member

wRAR commented Aug 2, 2023

I think when the lifecycle of each spider ends, all Items should be destroyed, along with their Fields.

Makes sense. As you have a long-lived CrawlerRunner instance you can try looking if there is any large data reachable from it (manually or using tools).

Additionally, you mentioned the possibility of dynamically created Fields being cached somewhere, and I will investigate this in my code.

I actually meant the Scrapy code, I don't know where can it happen if at all, but it's possible.

@BillYu811
Copy link
Author

I think when the lifecycle of each spider ends, all Items should be destroyed, along with their Fields.

Makes sense. As you have a long-lived CrawlerRunner instance you can try looking if there is any large data reachable from it (manually or using tools).

Additionally, you mentioned the possibility of dynamically created Fields being cached somewhere, and I will investigate this in my code.

I actually meant the Scrapy code, I don't know where can it happen if at all, but it's possible.

I disabled all the pipelines, but there was nothing changed. Even the output from muppy was almost the same. Here is the log:

          scrapy.item.Field |      497407 |    117.64 MB
                       dict |      235383 |     80.38 MB
       scrapy.item.ItemMeta |       70698 |     71.74 MB
                  frozenset |       71632 |     15.83 MB
                      tuple |      203166 |     12.60 MB
                        str |       79152 |      9.15 MB
                        set |       39306 |      8.32 MB
                    weakref |      121063 |      8.31 MB
          function (remove) |       35611 |      4.62 MB
                       code |       22176 |      3.97 MB
                       type |        3939 |      3.65 MB
                  _abc_data |       71082 |      3.25 MB
                        int |       79246 |      2.41 MB
                       list |       38087 |      2.39 MB
  weakref.WeakKeyDictionary |       35577 |      1.63 MB

I tried using pympler.classtracker to get the call stack for these objects (codes below), but an error occurred. I think I can not debug it in a short time.

tacker.track_class(scrapy.item.Field, resolution_level=3)
tacker.track_class(scrapy.item.ItemMeta, resolution_level=3)
tacker.create_snapshot()
 tacker.stats.print_summary()

Can you recommend any way or tools can help me to locate those "large data" objects? And could you try to reproduce this issue?

@wRAR
Copy link
Member

wRAR commented Aug 3, 2023

And could you try to reproduce this issue?

You haven't provided a minimal reproducible example so no.

@BillYu811
Copy link
Author

And could you try to reproduce this issue?

You haven't provided a minimal reproducible example so no.

In the previous step, we ruled out the problems with the pipelines. Therefore, I believe that by pasting the provided "schedule.py" into the same directory as "scrapy.cfg" and creating a YAML file as shown below, and finally, finding a few spiders and changing yield Item to yield dict_to_item("ItemClass", {**data}), this should enough as a MRE. Do you need me to provide more code? Please let me know.

# schedule.yml

scheduled:
  - method: interval
    spider: spider1
    seconds: 1800
    run_at_scheduled: True
  - method: cron
    spider: spider2
    expression: '0 19 * * 4,1'
  - method: cron
    spider: spider3
    expression: '0 */3 * * *'
# schedule.py
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor, defer
from twisted.internet.task import LoopingCall
from scrapy.utils.log import configure_logging
from croniter import croniter
from pympler import muppy, summary
import datetime as dt
import logging, yaml, sys

configure_logging(get_project_settings())
runner = CrawlerRunner(get_project_settings())

def schedule_next_crawl_interval(null, spider, sleep_time):
    logging.warning('<{}> next run after {} seconds.'.format(spider, sleep_time))
    reactor.callLater(sleep_time, crawl_interval, spider, sleep_time, True)


def schedule_next_crawl_at_time(null, spider, hour, minute, second=0):
    next_time = dt.datetime.now().replace(hour=hour, minute=minute, second=second)
    if dt.datetime.now() > next_time:
        next_time = next_time + dt.timedelta(days=1)
    sleep_time = (next_time - dt.datetime.now()).total_seconds()
    logging.warning('<{}> next run at {} (after {} seconds).'.format(spider, next_time, sleep_time))
    reactor.callLater(sleep_time, crawl_at_time, spider, hour, minute, second, True)


def schedule_next_crawl_cron(null, spider, expression):
    now = dt.datetime.now()
    _next = croniter(expression, now).get_next(dt.datetime)
    sleep_time = int(_next.timestamp() - now.timestamp())
    logging.warning('<{}> cron -{}- next run at {} (after {} seconds).'.format(spider, expression, _next, sleep_time))
    reactor.callLater(sleep_time, crawl_cron, spider, expression, True)


def crawl_job(spider, run_at_scheduled, kwargs):
    if run_at_scheduled:
        return runner.crawl(spider, **kwargs)
    else:
        return defer.succeed([])


# -minute hour day(of_month) month days(of_week)-
def crawl_cron(spider, expression, run_at_scheduled=False, kwargs=None):
    if kwargs is None:
        kwargs = {}
    d = crawl_job(spider, run_at_scheduled, kwargs)
    d.addCallback(schedule_next_crawl_cron, spider, expression)


def crawl_interval(spider, seconds=60, run_at_scheduled=False, kwargs=None):
    if kwargs is None:
        kwargs = {}
    d = crawl_job(spider, run_at_scheduled, kwargs)
    d.addCallback(schedule_next_crawl_interval, spider, seconds)


def crawl_at_time(spider, hour=0, minute=0, second=0, run_at_scheduled=False, kwargs=None):
    if kwargs is None:
        kwargs = {}
    d = crawl_job(spider, run_at_scheduled, kwargs)
    d.addCallback(schedule_next_crawl_at_time, spider, hour, minute, second)


def schedule_from_yml(file='scheduled.yml'):
    file = sys.path[0] + '/' + file
    with open(file) as f:
        scheduled = yaml.safe_load(f).get('scheduled', [])
    _allowed_method = ['interval', 'at_time', 'cron']
    for s in scheduled:
        logging.warning('add schedule {} from {}'.format(s, file))
        if s['method'] in _allowed_method:
            globals()['crawl_' + s.pop('method')](**s)


def debug_memory_lake():
    logging.warning("=" * 50 + "MEMORY LAKE DEBUG START" + "=" * 50)
    all_objects = muppy.get_objects()
    suml = summary.summarize(all_objects)
    summary.print_(suml)
    logging.warning("=" * 50 + "MEMORY LAKE DEBUG END" + "=" * 50)


schedule_from_yml()
LoopingCall(debug_memory_lake).start(60 * 30)
reactor.run()

@wRAR
Copy link
Member

wRAR commented Aug 3, 2023

I want you to provide a minimal reproducible example. What you are suggesting is not minimal, is not complete and after you get a minimal complete one you should reproduce the problem with it yourself before offering it.

@BillYu811
Copy link
Author

I want you to provide a minimal reproducible example. What you are suggesting is not minimal, is not complete and after you get a minimal complete one you should reproduce the problem with it yourself before offering it.

OK, this will take some time, and in the process I will also find some way to locate those "large data". Looks like it's going to be a big project.I will let you know if there is any progress, or give you a MRE repo.

@BillYu811
Copy link
Author

BillYu811 commented Aug 6, 2023

I've located the leak point using tracemalloc. following is the code and the results. @wRAR
It seems that when the spider yields a dynamic Item, it's being intercepted by middlewares, and a middleware is caching all iterable objects. This has left me puzzled, as middlewares should ideally only affect Requests, right?
I'm not very familiar with the Scrapy source code. Could you provide me with more information? Or is the current information enough for you to proceed with your work?

def display_top_memory_usage(snapshot, key_type='lineno', limit=5, num_frames=10):
    top_stats = snapshot.statistics(key_type)
    print("Top {} memory usage:".format(limit))
    for index, stat in enumerate(top_stats[:limit], 1):
        size = stat.size / (1024 * 1024) 
        print("#{}: {} MB".format(index, size))
        for line in stat.traceback.format():
            print(f"  > {line}")
        print("*" * 120)


def display_memory_usage():
    snapshot = tracemalloc.take_snapshot()
    display_top_memory_usage(snapshot, key_type='traceback', limit=10)
Top 10 memory usage:
#1: 119.09843444824219 MB
  >   File "scheduled.py", line 125
  >     reactor.run()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1318
  >     self.mainLoop()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1328
  >     reactorBaseSelf.runUntilCurrent()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 994
  >     call.func(*call.args, **call.kw)
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 682
  >     taskObj._oneWorkUnit()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 528
  >     result = next(self._iterator)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 91
  >     work = (callable(elem, *args, **named) for elem in iterable)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 240
  >     yield next(it)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29
  >     return (r for r in result or () if self._filter(r, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 336
  >     return (self._set_referer(r, response) for r in result or ())
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 28
  >     return (r for r in result or () if self._filter(r, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 32
  >     return (r for r in result or () if self._filter(r, response, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
************************************************************************************************************************
#2: 93.45801544189453 MB
  >   File "scheduled.py", line 125
  >     reactor.run()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1318
  >     self.mainLoop()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1328
  >     reactorBaseSelf.runUntilCurrent()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 994
  >     call.func(*call.args, **call.kw)
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 682
  >     taskObj._oneWorkUnit()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 528
  >     result = next(self._iterator)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 91
  >     work = (callable(elem, *args, **named) for elem in iterable)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 240
  >     yield next(it)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29
  >     return (r for r in result or () if self._filter(r, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 336
  >     return (self._set_referer(r, response) for r in result or ())
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 28
  >     return (r for r in result or () if self._filter(r, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 32
  >     return (r for r in result or () if self._filter(r, response, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/item.py", line 29
  >     _class = super().__new__(mcs, 'x_' + class_name, new_bases, attrs)
  >   File "/usr/local/lib/python3.8/abc.py", line 85
  >     cls = super().__new__(mcls, name, bases, namespace, **kwargs)
************************************************************************************************************************
#3: 68.70204162597656 MB
  >   File "scheduled.py", line 125
  >     reactor.run()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1318
  >     self.mainLoop()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1328
  >     reactorBaseSelf.runUntilCurrent()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 994
  >     call.func(*call.args, **call.kw)
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 682
  >     taskObj._oneWorkUnit()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 528
  >     result = next(self._iterator)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 91
  >     work = (callable(elem, *args, **named) for elem in iterable)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 240
  >     yield next(it)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29
  >     return (r for r in result or () if self._filter(r, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 336
  >     return (self._set_referer(r, response) for r in result or ())
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 28
  >     return (r for r in result or () if self._filter(r, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 32
  >     return (r for r in result or () if self._filter(r, response, spider))
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/item.py", line 44
  >     return super().__new__(mcs, class_name, bases, new_attrs)
  >   File "/usr/local/lib/python3.8/abc.py", line 85
  >     cls = super().__new__(mcls, name, bases, namespace, **kwargs)

@wRAR
Copy link
Member

wRAR commented Aug 6, 2023

a middleware is caching all iterable objects

Sorry, which middleware? Can you point to code that does this as it's not clear from the output you provided?

middlewares should ideally only affect Requests, right?

No.

Specifically, spider middlewares receive the callback results in process_spider_output().

@BillYu811
Copy link
Author

a middleware is caching all iterable objects

Sorry, which middleware? Can you point to code that does this as it's not clear from the output you provided?

middlewares should ideally only affect Requests, right?

No.

Specifically, spider middlewares receive the callback results in process_spider_output().

I apologize for I am not rigorous enough, at this stage, I cannot sure whether it's the middleware cached something or Twisted Internet did it, although I have suspicions about the middleware being the cause.

So my next step is disable these 4 SPIDER MIDDLEWARES that have been called in the call stack.

Can I disable the SPIDER MIDDLEWARES like this?

SPIDER_MIDDLEWARES_BASE = { 
"scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 50, 
"scrapy.spidermiddlewares.offsite.OffsiteMiddleware": None, 
"scrapy.spidermiddlewares.referer.RefererMiddleware": None, 
"scrapy.spidermiddlewares.urllength.UrlLengthMiddleware": None, 
"scrapy.spidermiddlewares.depth.DepthMiddleware": None, 
}

@wRAR
Copy link
Member

wRAR commented Aug 6, 2023

Yes.

@BillYu811
Copy link
Author

BillYu811 commented Aug 9, 2023

Yes.

I turned off the middlewares, but the leak still occurred. It seems likely that twisted internet or the scrapy core might have done something. Here is the output for reference:

Top 10 memory usage:
#1: 161.06756114959717 MB
  >   File "scheduled.py", line 125
  >     reactor.run()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1318
  >     self.mainLoop()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1328
  >     reactorBaseSelf.runUntilCurrent()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 994
  >     call.func(*call.args, **call.kw)
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 680
  >     taskObj._oneWorkUnit()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 526
  >     result = next(self._iterator)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 91
  >     work = (callable(elem, *args, **named) for elem in iterable)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 240
  >     yield next(it)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
************************************************************************************************************************
#2: 123.46784210205078 MB
  >   File "scheduled.py", line 125
  >     reactor.run()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1318
  >     self.mainLoop()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1328
  >     reactorBaseSelf.runUntilCurrent()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 994
  >     call.func(*call.args, **call.kw)
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 680
  >     taskObj._oneWorkUnit()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 526
  >     result = next(self._iterator)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 91
  >     work = (callable(elem, *args, **named) for elem in iterable)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 240
  >     yield next(it)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/item.py", line 29
  >     _class = super().__new__(mcs, 'x_' + class_name, new_bases, attrs)
  >   File "/usr/local/lib/python3.8/abc.py", line 85
  >     cls = super().__new__(mcls, name, bases, namespace, **kwargs)
************************************************************************************************************************
#3: 91.13088989257812 MB
  >   File "scheduled.py", line 125
  >     reactor.run()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1318
  >     self.mainLoop()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 1328
  >     reactorBaseSelf.runUntilCurrent()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/base.py", line 994
  >     call.func(*call.args, **call.kw)
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 680
  >     taskObj._oneWorkUnit()
  >   File "/usr/local/lib/python3.8/site-packages/twisted/internet/task.py", line 526
  >     result = next(self._iterator)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 91
  >     work = (callable(elem, *args, **named) for elem in iterable)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/defer.py", line 240
  >     yield next(it)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/utils/python.py", line 338
  >     return next(self.data)
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 79
  >     for r in iterable:
  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
  >   File "/usr/local/lib/python3.8/site-packages/scrapy/item.py", line 44
  >     return super().__new__(mcs, class_name, bases, new_attrs)
  >   File "/usr/local/lib/python3.8/abc.py", line 85
  >     cls = super().__new__(mcls, name, bases, namespace, **kwargs)

@wRAR
Copy link
Member

wRAR commented Aug 9, 2023

I don't know how to interpret this output but it looks like it just shows where the objects were created?

@GeorgeA92
Copy link
Contributor

@BillYu811
This lines from traceback clearly indicates that this leak caused by Your custom code (not scrapy code I don't see any kind of bugs, or unexpected behaviour from scrapy code side):

  >   File "/app/crawlers/crawlers/spiders/my_spider.py", line 45
  >     yield dict_to_item("MyItem", {
  >   File "/app/crawlers/crawlers/items.py", line 11
  >     item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})

In general case we expect that one or several scrapy.Item classes with predefined(fixed) set of scrapy.Field exist in scrapy project.

I don't remember anything like dynamic item (in terms that multiple scrapy.Item classes generated with dynamically(during scrapy runtime) definedscrapy.Field names) mentioned on documentation.

I usually use yield dict_to_item("ItemClass", {**data}) (implement by the following codes) to convert data_dict to item, In the pipeline, data will inserted into different mongo collections by different item.class.name

I don't think that You need to return items as scrapy.Item class objects. Did You try to return.. just dict objects as items (without convertation to scrapy.Item) ?

def dict_to_item(class_name: str, dictionary: dict):
    cache = get_project_settings().get("DICT_TO_ITEM_CACHE", None) # <may shared between multiple spider runs.
    if cache:
        if cache.get(class_name,None):
            item_cls = cache[class_name]
        else:
            item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
            cache[class_name] = item_cls
        return item_cls().update(dictionary)
    return type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})().update(dictionary)

Inside dict_to_item Your application dynamically increase size of custom DICT_TO_ITEM_CACHE setting.

And Running multiple spiders in single process using CrawlerRunner means that on each new scheduled spider will use... the same instance of DICT_TO_ITEM_CACHE setting (it will be bigger on each new scheduled spider as it may contain cached values from previous runs). I'd recommend to somehow periodically log/track.. size of DICT_TO_ITEM_CACHE setting during runtime.

@BillYu811
Copy link
Author

@GeorgeA92 @wRAR

Thank you very much for your response. It seems using a dict() as returning for spider might indeed help avoid this issue. Refactoring the spider and returning dict as an option is something I'm considering. However, I believe the possibility of a leak issue still remains.

I don't remember anything like dynamic item (in terms that multiple scrapy.Item classes generated with dynamically(during scrapy runtime) definedscrapy.Field names) mentioned on documentation.

Yes, you're right. The documentation doesn't mention anything about dynamic items. The reason I attempted to use dynamic items is as mentioned before: for some simple crawlers, I can use explicitly defined items, but for more complex ones, I can't determine all field names in advance (I need to dynamically define the item). I wanted to only one standardize pipeline to handle all data, hence I explored this approach based on input from ChatGPT.

Inside dict_to_item Your application dynamically increase size of custom DICT_TO_ITEM_CACHE setting.
And Running multiple spiders in single process using CrawlerRunner means that on each new scheduled spider will use... the same instance of DICT_TO_ITEM_CACHE setting (it will be bigger on each new scheduled spider as it may contain cached values from previous runs). I'd recommend to somehow periodically log/track.. size of DICT_TO_ITEM_CACHE setting during runtime.

fist time, I encountered the leak when using the dict_to_item as :

def dict_to_item(class_name: str, dictionary: dict):
    item_cls = type(class_name, (scrapy.Item,), {k: scrapy.Field() for k in dictionary.keys()})
    return item_cls(**dictionary)

I thought the leak might be due to repetitive definitions of item_cls, so I implemented the dict_to_item with caching (you mentioned). However, the leak still persisted, leading me to believe that it might not be the cache (or the duplicate field definitions) causing the issue.

This lines from traceback clearly indicates that this leak caused by Your custom code (not scrapy code I don't see any kind of bugs, or unexpected behaviour from scrapy code side)

Just as I mentioned the lifecycle of spider, I believe that when a spider's lifecycle ends, the items it uses should be released, and consequently, the fields should be released as well. Therefore, I think dynamically defined items and fields would be destroyed at the end of a TaskUnit, thats should be makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants