Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give focus_crawl a chance to access page body before discarding it #83

Open
wants to merge 1 commit into
base: next
Choose a base branch
from
Open

Give focus_crawl a chance to access page body before discarding it #83

wants to merge 1 commit into from

Conversation

lankz
Copy link

@lankz lankz commented Jan 27, 2014

For site-specific crawlers, it's fair enough to use focus_crawl like this:

anemone.focus_crawl do |page|
  if page.doc
    page.doc.search('.//a[@href]').map { |a| URI.parse(a[:href]) }
  else
    page.links
  end
end

However when using the discard_page_bodies option, page.doc is nil by the time we enter this block. In this pull request I've moved until after focus_crawl has been called.

@tmaier
Copy link

tmaier commented May 8, 2014

+1

@brutuscat
Copy link
Contributor

@lankz I don't quite follow... I'm happy to accept this PR in the Medusa fork if you please could explain a bit better the use case and re-post it there :)

@lankz
Copy link
Author

lankz commented Dec 15, 2014

@brutuscat I stopped using Anemone a while ago, and can't seem to access the original documentation — but I believe the suggested use case for #focus_crawl is something like this:

anemone.focus_crawl do |page|
  page.links \
      .select { |uri| uri.to_s =~ /productId=\d+/ }
end

which works just fine for simple crawls of well structured sites. I needed to crawl a few large, messy sites and the only way I could come up with to keep Anemone under control (crawl only the pages I was interested in, and keep it from blowing memory) was to focus only on links that appear under certain elements on the page using XPath and CSS selectors:

anemone.focus_crawl do |page|
  if page.doc
    # crawl only links found in the primary navigation bar
    page.doc \
        .search('.//nav/a[@href]') \
        .map { |a| URI.parse(a[:href]) }
  else
    # sometimes +page.doc+ is empty, like when get a redirect
    page.links
  end
end    

The problem I ran into is that, when using the discard_page_bodies option, the page.doc object has already been discarded by the time the #focus_crawl block is called.

The change in this pull request is simple — delay the call to discard the page body (discard_doc!) until after we've both extracted all the links (default Anemone functionality) and given #focus_crawl a chance to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants