Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to reduce RAM usage #51

Open
wants to merge 5 commits into
base: next
Choose a base branch
from

Conversation

wordtracker
Copy link

Hi,

We found with large, responsive sites (e.g. Wikipedia), the page_queue could quickly and continuously grow until we ran out of RAM.

I believe this is because the thread that processes the crawled pages does not get adequate time to run when crawling a responsive site -- there's not much free time available waiting for HTTP responses. We could use the 'sleep' option, but this unnecessarily slows crawling on smaller/less responsive sites.

Changing the PageStore option has no effect on this as the page_queue does not live there.

Secondly, we are running multiple concurrent crawls, therefore we can't use any of the PageStore alternatives as they assume one-crawl-at-a-time. Therefore, I added an option to not retain the processed data as this was using RAM for a feature we don't currently need (after_crawl).

I'm happy to split the changes up if that would change their acceptability. I appreciate the way I implemented the second change is not ideal.

Thanks,
Jamie

chriskite and others added 5 commits January 19, 2012 22:04
This alleviates a problem we experienced with very responsive sites (wikipedia)
and a moderate per-page processing time. The page_queue would grow much faster
than it could be drained, using more and more RAM.

This change means that the queue grows until full, at which point the
Tentacles will block until the queue shrinks.

This does have the impact of slowing the crawl in some cases.
@leehambley
Copy link

Any movement here?

I haven't diagnosed yet completely, but on a site with 28,000,000 pages indexed by google, I'm expecting a memory problem having watched a 20 thread process grow from a nominal memory usage at the beginning of the run to more than 1Gb in 1.5 hours. (Having crawled 626,419 pages, according to echo 'KEYS anemone:pages:*' | redis-cli | wc -l)

I'm thinking about trying out this patch to reign in the memory usage as I also don't need the after_crawl feature, I would have expected that Anemone could store that page list in Redis, retrieving it after the crawl using the backend store, not persisting something in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants