Skip to content

crawlers and heuristics

Felix Hamborg edited this page Oct 5, 2017 · 3 revisions

Crawlers and heuristics

This section explains the different crawlers and heuristics implemented in news-please.

Crawlers

news-please provides multiple scrapers, this part explains their function and when to use them. Crawlers can be set global in the newscrawler.cfg or for a specific website in input_data.hjson. ###Download crawler

  • Class name: "Download"

  • Functionality:
    All this crawler does is crawling the specified urls. For this crawler, the input .json file can contain a list of urls:

    "url":["http://example.com/1", "http://example.com/2"]
    
  • Requirements:
    This spider should work on any given url.

  • Use case:
    Only when an exact page is needed, for testing purposes mostly.

RSS crawler

  • Class name: "RssCrawler"

  • Functionality:
    This spider starts on the given url, extracts the site's rss feed, parses the found feed and crawls every link within it, meaning the crawler will test every link in the sitemap on being an article and passes it to the pipeline if the result is positive.

  • Requirements:
    This spider should work on any given webpage that contains a valid href to a valid xml feed (rss).

  • Reliability:
    The spider finds only those articles that are listed in the xml feed.

  • Use case:
    This spider is about as efficient as the sitemap crawler, though it usually crawls a much smaller xml feed that only contains the latest articles. Thus this crawler should be used whenever it is important to update an already existing database with the latest articles.

    Since only the latest articles are listed in the rss feed, this crawler should be executed at a high frequency (daemonize).

Sitemap crawler

  • Class name: "SitemapCrawler"

  • Functionality:
    This spider extracts and crawls the domains' sitemap from its robots.txt, meaning the crawler will test every link in the sitemap on being an article and passes it to the pipeline if the result is positive.

  • Requirements:
    This spider should work on any given webpage that does have a robots.txt and lists a valid link to a valid sitemap within it.

  • Reliability:
    The spider finds every article that is listed in the sitemap. There's no guarantee though that every published article is listed in the sitemap.

  • Use case:
    This spider is pretty fast for crawling an entire domain. Thus, it should be used whenever possible.

Recursive crawler

  • Class name: "RecursiveCrawler"

  • Functionality:
    This spider starts at a given url and then recursively crawls all hrefs if the hrefs do not match the ignore_regex set in the .json file or any of the ignore_file_extensions set in the .cfg file.

    At last, it tests the response on being an article and passes it to the pipeline if the result is positive.

  • Requirements:
    This spider should work on any given webpage.

  • Reliability:
    The spider finds every article that can be accessed by following links from the given url that do not point to off-domain-pages. This spider obviously does not find articles, that aren't linked anywhere on the domain (it may not find some articles listed in the page's sitemap).

  • Use case:
    This spider takes a long time since it crawls a lot of hrefs that point to invalid pages, off-domain-pages and already crawled pages. Thus, it should only be used when the SitemapCrawler fails.

Recursive sitemap crawler

  • Class name: "RecursiveSitemapCrawler"

  • Functionality:
    This spider extracts and crawls the domains' sitemap from its robots.txt, then recursively crawls all hrefs if the hrefs do not match the ignore_regex set in the .json file or any of the ignore_file_extensions set in the .cfg file.

    At last, it tests the response on being an article and passes it to the pipeline if the result is positive.

  • Requirements:
    This spider should work on any given webpage that does have a robots.txt and lists a valid link to a valid sitemap within it.

  • Reliability:
    The spider finds every article that is listed in the sitemap and every article that can be accessed by following links that do not point to off-domain-pages from any of those pages.

    This crawler might find the most articles of all the crawlers.

  • Use case:
    This spider is about as slow as the recursive crawler since it crawls a lot of hrefs that point to invalid pages, off-domain-pages and already crawled pages. It might find more articles than any other crawler and should only be used if completeness is the most important.

Heuristics

Heuristics are used to detect if a website contains an article and should be passed to the pipeline. It is possible to set the used heuristics global in newscrawler.cfg, but overwriting these default settings in the input_data.hjson file for each target should produce better results.

og type

  • Heuristic name: "og_type"

  • Assumption:
    On every newssite og:type is set to article if the website is an article or something similar to an article.

  • Idea:
    A website must contain <meta property="og:type" content="article">.

  • Implementation:
    Return True if og:type is article, otherwise False.

  • Outcome:
    In fact, every news-website uses this "tagging". So this is a minimum requirement. The problem is, that some websites also tag news-category-sites as single articles. For example http://www.zeit.de/kultur/literatur/index is tagged as article, but is not an article. These sites must still be filtered out.

linked headlines

  • Heuristic name: "linked_headlines"

  • Assumption:
    If a site contains mostly linked headlines, it is just a news-aggregation of multiple articles, thus not a real article.

  • Idea:
    Check how many <h1>, <h2>, <h3>, <h4>, <h5> and <h6> are on a site and how many of them contain an <a href>.

  • Implementation:
    Return a ratio: linked-headlines divided by headlines. There is a setting in newscrawler.cfg disabling the heuristic if a site doesn't contain enough headlines.

  • Outcome:
    News aggregation-sites normally contain a linked-headlines ratio near 1. These will successfully be filtered out. Some sites will still remain. This heuristic still needs testing._

self linked headlines

  • Heuristic name: "self_linked_headlines"

  • Assumption:
    Links to other sites in headlines are mostly editorial, so only if a linked headlines mostly link to subsites, its a news-aggregation-site.

  • Idea:
    Same as linked_headlines, but just count headlines linked to the same domain.

  • Implementation:
    Return a ratio: linked-to-same-domain-headlines divided by headlines.

  • Outcome:
    Not tested.

is not from subdomain

  • Heuristic name: "is_not_from_subdomain"

  • Assumption:
    Subdomains are mostly blogs or logins. Blogs may still contain og:type=article, but are not real "articles".

  • Idea:
    Do not download files from (other) subdomains as the starting-domain.

  • Implementation:
    Return True if not from (another) subdomain.

  • Outcome:
    If the site heavily uses subdomains, for example for categories, this heuristic will fail. So this heuristic should only used on websites where one is sure, that subdomains do not contain articles.