crawlUrlfilter #55

mustaszewski · 2019-03-25T14:45:34Z

Thank you for developing this very useful package. However, I have a problem with the crawlUrlfilter argument.
From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the crawlUrlfilter does exactly what I am looking for.

When the pattern passed to crawlUrlfilter contains only one level of the URL, like in the following code
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")

I get the desired results, i.e. only those URLS that match the pattern "article", e.g.

https://www.somewebsite.org/article/sample-article-217 or
https://www.somewebsite.org/article/2019-01-20-another-example

However, when I want to filter URLs based on a pattern of two levels of the URL, such as:

https://www.somewebsite.org/article/news/january-2019-meeting_of_trainers or
https://www.somewebsite.org/article/news/review-of-meetup

the following code does not find any matches:

Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/news")

Is this a bug, or am I getting something wrong?
Following the example given in the documentation dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/" it should be no problem at all passing an argument that contains several "/".

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawlUrlfilter #55

crawlUrlfilter #55

mustaszewski commented Mar 25, 2019

crawlUrlfilter #55

crawlUrlfilter #55

Comments

mustaszewski commented Mar 25, 2019