Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes OutOfMemory error for large sites #31

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pokey909
Copy link

@pokey909 pokey909 commented Sep 4, 2011

  • Added support for external queues via :large_scale_crawl option. (Requires R/W permission for working dir)
  • Improved Thread handling. All threads now properly start working on the crawl

Alexander Lenhardt added 4 commits August 31, 2011 21:47
Occurs when crawling larges sites.

Issue: link_queue grows faster than threads consume links.

Fix: Wait until threads consumed enough links, then continue adding more to the queue.
- OutOfMemory caused by large link/page queues. Added thread safe ExtQueue class which swaps to disk when too much memory is consumed
- Improved threading. Most worker threads kept idling when launched simultaneously

Signed-off-by: Alexander Lenhardt <alenhard@techfak.uni-bielefeld.de>
External queue storage can be activated via new option :large_scale_crawl

Signed-off-by: Alexander Lenhardt <alenhard@techfak.uni-bielefeld.de>
@RonnieOnRails
Copy link

Do you have a 0.7.1 version for this pull request ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants