Changes to reduce RAM usage #51

wordtracker · 2012-04-23T10:47:41Z

Hi,

We found with large, responsive sites (e.g. Wikipedia), the page_queue could quickly and continuously grow until we ran out of RAM.

I believe this is because the thread that processes the crawled pages does not get adequate time to run when crawling a responsive site -- there's not much free time available waiting for HTTP responses. We could use the 'sleep' option, but this unnecessarily slows crawling on smaller/less responsive sites.

Changing the PageStore option has no effect on this as the page_queue does not live there.

Secondly, we are running multiple concurrent crawls, therefore we can't use any of the PageStore alternatives as they assume one-crawl-at-a-time. Therefore, I added an option to not retain the processed data as this was using RAM for a feature we don't currently need (after_crawl).

I'm happy to split the changes up if that would change their acceptability. I appreciate the way I implemented the second change is not ideal.

Thanks,
Jamie

This alleviates a problem we experienced with very responsive sites (wikipedia) and a moderate per-page processing time. The page_queue would grow much faster than it could be drained, using more and more RAM. This change means that the queue grows until full, at which point the Tentacles will block until the queue shrinks. This does have the impact of slowing the crawl in some cases.

…age use.

…ase storage use." This reverts commit 2166b99.

leehambley · 2012-09-24T20:46:20Z

Any movement here?

I haven't diagnosed yet completely, but on a site with 28,000,000 pages indexed by google, I'm expecting a memory problem having watched a 20 thread process grow from a nominal memory usage at the beginning of the run to more than 1Gb in 1.5 hours. (Having crawled 626,419 pages, according to echo 'KEYS anemone:pages:*' | redis-cli | wc -l)

I'm thinking about trying out this patch to reign in the memory usage as I also don't need the after_crawl feature, I would have expected that Anemone could store that page list in Redis, retrieving it after the crawl using the backend store, not persisting something in memory.

chriskite and others added 5 commits January 19, 2012 22:04

Merge branch 'next'

4b378d5

Merge branch 'next'

531d771

Add option 'discard_page_data' to allow user to further decrease stor…

2166b99

…age use.

Revert "Add option 'discard_page_data' to allow user to further decre…

8b51d29

…ase storage use." This reverts commit 2166b99.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to reduce RAM usage #51

Changes to reduce RAM usage #51

wordtracker commented Apr 23, 2012

leehambley commented Sep 24, 2012

Changes to reduce RAM usage #51

Are you sure you want to change the base?

Changes to reduce RAM usage #51

Conversation

wordtracker commented Apr 23, 2012

leehambley commented Sep 24, 2012