Give focus_crawl a chance to access page body before discarding it #83

lankz · 2014-01-27T07:18:45Z

For site-specific crawlers, it's fair enough to use focus_crawl like this:

anemone.focus_crawl do |page|
  if page.doc
    page.doc.search('.//a[@href]').map { |a| URI.parse(a[:href]) }
  else
    page.links
  end
end

However when using the discard_page_bodies option, page.doc is nil by the time we enter this block. In this pull request I've moved until after focus_crawl has been called.

tmaier · 2014-05-08T12:00:59Z

+1

brutuscat · 2014-12-14T17:48:15Z

@lankz I don't quite follow... I'm happy to accept this PR in the Medusa fork if you please could explain a bit better the use case and re-post it there :)

lankz · 2014-12-15T00:08:47Z

@brutuscat I stopped using Anemone a while ago, and can't seem to access the original documentation — but I believe the suggested use case for #focus_crawl is something like this:

anemone.focus_crawl do |page|
  page.links \
      .select { |uri| uri.to_s =~ /productId=\d+/ }
end

which works just fine for simple crawls of well structured sites. I needed to crawl a few large, messy sites and the only way I could come up with to keep Anemone under control (crawl only the pages I was interested in, and keep it from blowing memory) was to focus only on links that appear under certain elements on the page using XPath and CSS selectors:

anemone.focus_crawl do |page|
  if page.doc
    # crawl only links found in the primary navigation bar
    page.doc \
        .search('.//nav/a[@href]') \
        .map { |a| URI.parse(a[:href]) }
  else
    # sometimes +page.doc+ is empty, like when get a redirect
    page.links
  end
end

The problem I ran into is that, when using the discard_page_bodies option, the page.doc object has already been discarded by the time the #focus_crawl block is called.

The change in this pull request is simple — delay the call to discard the page body (discard_doc!) until after we've both extracted all the links (default Anemone functionality) and given #focus_crawl a chance to run.

Give focus_crawl a chance to access page body before discarding it

257f299

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give focus_crawl a chance to access page body before discarding it #83

Give focus_crawl a chance to access page body before discarding it #83

lankz commented Jan 27, 2014

tmaier commented May 8, 2014

brutuscat commented Dec 14, 2014

lankz commented Dec 15, 2014

Give focus_crawl a chance to access page body before discarding it #83

Are you sure you want to change the base?

Give focus_crawl a chance to access page body before discarding it #83

Conversation

lankz commented Jan 27, 2014

tmaier commented May 8, 2014

brutuscat commented Dec 14, 2014

lankz commented Dec 15, 2014