Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crawlUrlfilter #55

Open
mustaszewski opened this issue Mar 25, 2019 · 0 comments
Open

crawlUrlfilter #55

mustaszewski opened this issue Mar 25, 2019 · 0 comments

Comments

@mustaszewski
Copy link

Thank you for developing this very useful package. However, I have a problem with the crawlUrlfilter argument.
From a large website, I would like to crawl and scrape only those URLs that match a specific pattern. According to the documentation, the crawlUrlfilter does exactly what I am looking for.

When the pattern passed to crawlUrlfilter contains only one level of the URL, like in the following code
Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/")

I get the desired results, i.e. only those URLS that match the pattern "article", e.g.

https://www.somewebsite.org/article/sample-article-217 or
https://www.somewebsite.org/article/2019-01-20-another-example

However, when I want to filter URLs based on a pattern of two levels of the URL, such as:

https://www.somewebsite.org/article/news/january-2019-meeting_of_trainers or
https://www.somewebsite.org/article/news/review-of-meetup

the following code does not find any matches:

Rcrawler(Website = "https://www.somewebsite.org/", crawlUrlfilter = "/article/news")

Is this a bug, or am I getting something wrong?
Following the example given in the documentation dataUrlfilter ="/[0-9]{4}/[0-9]{2}/[0-9]{2}/" it should be no problem at all passing an argument that contains several "/".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant