Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splash memory leak #312

Open
Ethan353 opened this issue Jan 13, 2024 · 0 comments
Open

Splash memory leak #312

Ethan353 opened this issue Jan 13, 2024 · 0 comments

Comments

@Ethan353
Copy link

I have used scrapy splash for requesting in my crawling service. after amount of time my services usage of ram increase continuesly and after a while they use all ram of a vm. the wierd thing is splash service it self works properly but services which use splash for requests have memory leak. for more detail here is my code snippet and splash config i uses:
code:

if condition_to_use_splash:
    return SplashRequest(url, errback=self.errback, callback=self.parse, meta=metadata, args={'wait': 7})
else:
    return FormRequest(url, dont_filter=True, errback=self.errback,method=method, formdata=parameter, meta=metadata)

config:

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
    'solaris_scrapy.solaris_scrapy.middlewares.ProxyMiddleware': 100,
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

I use splash 3.1 as splash image and it is my splash service docker compose:

services:
  splash:
    image: scrapinghub/splash:3.1
    ports:
      - "prot:port"
    networks:
      - net

note that I run my code on a vm in a docker container.
what do you think I should do about. I also aware of memory limit, maxrss and slots for preventing splash use lots of ram but this way causes my crawling service misses bunch of websites. how should I handle It in my code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant