Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more help for spidering/crawling #193

Open
jeroenjanssens opened this issue Apr 20, 2017 · 7 comments
Open

Provide more help for spidering/crawling #193

jeroenjanssens opened this issue Apr 20, 2017 · 7 comments
Labels
feature a feature request or enhancement

Comments

@jeroenjanssens
Copy link

Oftentimes, content is paginated into multiple HTML documents. When the number of HTML documents is known and their corresponding URLs can be generated beforehand, then selecting the desired nodes is a matter of combining xml2::read_html() and rvest::html_nodes() with, say, purrr::map().

When the number of HTML documents is unknown or when their corresponding URLs cannot be generated beforehand we need a different approach. One approach is to "click" the More button using rvest::follow_link() and recursion. I recently implemented this approach as follows:

library(rvest)

html_more_nodes <- function(session, css, more_css) {
  xml2:::xml_nodeset(c(
    html_nodes(session, css),
    tryCatch({
      html_more_nodes(follow_link(session, css = more_css),
                      css, more_css)
    }, error = function(e) NULL)
  ))
}

# Follow "More" link to get all stories on Hacker News
html_session("https://news.ycombinator.com") %>%
  html_more_nodes(".storylink", ".morelink") %>%
  html_text()

I asked @hadley whether it makes sense to make this functionality part of rvest (see https://twitter.com/jeroenhjanssens/status/854390989942919170). I think at least the following questions need answering:

  • Is this functionality common enough?
  • Is the above code the right approach?
  • Do we need to extend the above code such that a maximum number of nodes or documents can be specified?

I'd be happy to draft up a PR, but first I'm curious to hear your thoughts. Many thanks.

@hadley hadley added the feature a feature request or enhancement label Mar 17, 2019
@hadley
Copy link
Member

hadley commented Dec 15, 2020

I think it might make sense to start with a function that returns a list of pages, rather than flattening into a single list of nodes. Would definitely want to supply maximum number of pages.

@hadley
Copy link
Member

hadley commented Dec 20, 2020

Seems like maybe it should be a function that works with an html_session(). Maybe session_spider()? What happens if there are multiple matches to the link? Should it be a breadth first or depth first search?

Or would it be better to provide a new html_pages() class that you could apply html_nodes() to and have it automatically flatten the output?

@Rileykeff

This comment has been minimized.

@hadley

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Jan 16, 2021

In general, this issue is really about "crawling" — i.e. making a queue of urls and systematically visiting them, parsing data, and adding more links to the queue. In an ideal world this would be done as a synchronously as possible so that one web page could be downloading while is another is parsing. (While still incorporating rate-limiting to avoid hammering a single site).

See https://github.com/salimk/Rcrawler for related work.

@jeroenjanssens
Copy link
Author

I think it might make sense to start with a function that returns a list of pages, rather than flattening into a single list of nodes. Would definitely want to supply maximum number of pages.

That indeed makes sense, because you could always pass the list of pages to html_nodes() if you want to.

Maybe session_spider()?

Or perhaps session_crawl() so that it's a verb?

What happens if there are multiple matches to the link?

I think all matching links should be visited by default. Depending on the HTML, there might be a CSS selector or XPath expression that matches only one link. Of course this is not always the case, so I'm wondering whether it makes sense to allow the css/xpath argument to be a function that returns a node or a list of one or more nodes? Something like:

session_crawl(s, ~ html_nodes(.x, "a")[1]) %>% html_node("title")

Should it be a breadth first or depth first search?

My gut feeling tells me that most often you'd want to do a breath first search, because the length of a page is usually shorter than the potential number of links the crawler may wonder off to. I guess it really depends on the situation, so concrete examples might be useful here. Or would it make sense to let the user specify this?

@hadley
Copy link
Member

hadley commented Feb 27, 2024

Another place to look for API inspiration isscrapy.

@hadley hadley changed the title Select nodes from paginated content Provide more help for spidering/crawling Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants