Provide more help for spidering/crawling #193

jeroenjanssens · 2017-04-20T14:14:30Z

Oftentimes, content is paginated into multiple HTML documents. When the number of HTML documents is known and their corresponding URLs can be generated beforehand, then selecting the desired nodes is a matter of combining xml2::read_html() and rvest::html_nodes() with, say, purrr::map().

When the number of HTML documents is unknown or when their corresponding URLs cannot be generated beforehand we need a different approach. One approach is to "click" the More button using rvest::follow_link() and recursion. I recently implemented this approach as follows:

library(rvest)

html_more_nodes <- function(session, css, more_css) {
  xml2:::xml_nodeset(c(
    html_nodes(session, css),
    tryCatch({
      html_more_nodes(follow_link(session, css = more_css),
                      css, more_css)
    }, error = function(e) NULL)
  ))
}

# Follow "More" link to get all stories on Hacker News
html_session("https://news.ycombinator.com") %>%
  html_more_nodes(".storylink", ".morelink") %>%
  html_text()

I asked @hadley whether it makes sense to make this functionality part of rvest (see https://twitter.com/jeroenhjanssens/status/854390989942919170). I think at least the following questions need answering:

Is this functionality common enough?
Is the above code the right approach?
Do we need to extend the above code such that a maximum number of nodes or documents can be specified?

I'd be happy to draft up a PR, but first I'm curious to hear your thoughts. Many thanks.

The text was updated successfully, but these errors were encountered:

hadley · 2020-12-15T22:00:48Z

I think it might make sense to start with a function that returns a list of pages, rather than flattening into a single list of nodes. Would definitely want to supply maximum number of pages.

hadley · 2020-12-20T18:22:41Z

Seems like maybe it should be a function that works with an html_session(). Maybe session_spider()? What happens if there are multiple matches to the link? Should it be a breadth first or depth first search?

Or would it be better to provide a new html_pages() class that you could apply html_nodes() to and have it automatically flatten the output?

hadley · 2021-01-16T14:42:20Z

In general, this issue is really about "crawling" — i.e. making a queue of urls and systematically visiting them, parsing data, and adding more links to the queue. In an ideal world this would be done as a synchronously as possible so that one web page could be downloading while is another is parsing. (While still incorporating rate-limiting to avoid hammering a single site).

See https://github.com/salimk/Rcrawler for related work.

jeroenjanssens · 2021-01-27T13:47:27Z

I think it might make sense to start with a function that returns a list of pages, rather than flattening into a single list of nodes. Would definitely want to supply maximum number of pages.

That indeed makes sense, because you could always pass the list of pages to html_nodes() if you want to.

Maybe session_spider()?

Or perhaps session_crawl() so that it's a verb?

What happens if there are multiple matches to the link?

I think all matching links should be visited by default. Depending on the HTML, there might be a CSS selector or XPath expression that matches only one link. Of course this is not always the case, so I'm wondering whether it makes sense to allow the css/xpath argument to be a function that returns a node or a list of one or more nodes? Something like:

session_crawl(s, ~ html_nodes(.x, "a")[1]) %>% html_node("title")

Should it be a breadth first or depth first search?

My gut feeling tells me that most often you'd want to do a breath first search, because the length of a page is usually shorter than the potential number of links the crawler may wonder off to. I guess it really depends on the situation, so concrete examples might be useful here. Or would it make sense to let the user specify this?

hadley · 2024-02-27T14:28:22Z

Another place to look for API inspiration isscrapy.

hadley added the feature a feature request or enhancement label Mar 17, 2019

This comment has been minimized.

Sign in to view

hadley mentioned this issue Oct 13, 2023

iterate_next_request() -> resp_next_request() r-lib/httr2#341

Closed

hadley mentioned this issue Feb 27, 2024

Provide more help with scraping? #366

Closed

hadley changed the title ~~Select nodes from paginated content~~ Provide more help for spidering/crawling Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide more help for spidering/crawling #193

Provide more help for spidering/crawling #193

jeroenjanssens commented Apr 20, 2017

hadley commented Dec 15, 2020

hadley commented Dec 20, 2020

This comment has been minimized.

This comment has been minimized.

hadley commented Jan 16, 2021

jeroenjanssens commented Jan 27, 2021

hadley commented Feb 27, 2024

Provide more help for spidering/crawling #193

Provide more help for spidering/crawling #193

Comments

jeroenjanssens commented Apr 20, 2017

hadley commented Dec 15, 2020

hadley commented Dec 20, 2020

This comment has been minimized.

This comment has been minimized.

hadley commented Jan 16, 2021

jeroenjanssens commented Jan 27, 2021

hadley commented Feb 27, 2024