Skip to content

user guide

Felix Hamborg edited this page Jan 18, 2018 · 8 revisions

news-please user guide

This guide helps users learn how to use and configure news-please. This guide describes running news-please in CLI mode (with full crawling and extraction). If you want to programmatically use news-please within your Python project, or if you want to extract articles from commoncrawl.org, please refer to the README.md.

Basic setup

news-please is a registered PyPi package and can be installed using pip. While news-please runs on both, Python 2.7+ and 3.x, we recommend Python 3.5 and explain the setup for this version.

Windows systems

Users of Windows systems may experience problems installing news-please with pip due to missing requirements. Therefore we have to install the required packages manually:

  • lxml:

    1. Go to Christoph's Gohlke's Python page and download the compatible wheel for your system.
      (32bit : "lxml-X.X.X-cp35-cp35m-win32.whl"; 64bit: "lxml-X.X.X-cp35-cp35m-win_amd64.whl")

    2. Open the Windows console and navigate to your Python installation:

       C:\Users\USERNAME>cd  C:\Python35  
      
    3. Install the wheel with the following command:

       C:\Python35> pip install lxml-X.X.X-cp35-cp35m-win32.whl  
      
  • pywin32:

    1. Download the latest build of pywin32.
      Make sure you select correct version (matches Python version, 32bit/64bit)

    2. Execute the installer

Install news-please

news-please is a registered PyPi package and can be installed via pip:

sudo pip install news-please

Minimal configuration

Before we can start a simple test run we have to check the configuration. news-please will automatically generate a config directory and files if the directory does not exist. The default location is ~/news-please/config, which can be changed by providing a custom location using the -c parameter.

For our first test run we only look at the [Elasticsearch] section.

This section handles the the connection to the Elasticsearch database. If you freshly installed Elasticsearch on your system you probably wont need change the configuration. Otherwise you should review the default settings.

Address of the Elasticsearch database and the used port:

host = localhost
port = 9200	

The indices used to store the extracted meta-data:

index_current = 'news-please'
index_archive = 'news-please-archive'

Credentials used for Authentication (supports CA-certificates):

use_ca_certificates = False'           #If True Authentification is performed 
ca_cert_path = '/path/to/cacert.pem'  
client_cert_path = '/path/to/client_cert.pem'  
client_key_path = '/path/to/client_key.pem'  
username = 'root'  
secret = 'password'  

While not necessary, its highly recommended to change the user-agent. Otherwise, it is likely that the crawler will be blocked from many sites or earlier.

USER_AGENT = 'news-please (+http://www.example.com)'

First test run

Be sure to have your server Elasticsearch running. Open a terminal and enter the following code lines:

news-please

If you did not install news-please with pip but checked out the source code, you can also go into the source code directory and run python __main__.py. Let the programm run for a minute and terminate it by pressing CTRL+C once. Wait for news-please to terminate gracefully instead of pressing CTRL+C multiple times.

Inspect results stored in Elasticsearch

While it is possible to retrieve data stored in Elasticsearch without any specific tools we recommend ElasticHQ. In order to use ElasticHQ follow these simple steps:

  1. Ensure the database is not running!

  2. Open the configuration file elasticsearch.yml located at either /etc/elasticsearch/ or
    at ./elasticsearch/conf/ if downloaded as archive.

  3. Add the following lines at the bottom of the file:

     http.cors.enabled : true  
     http.cors.allow-origin : "*"
     http.cors.allow-methods : OPTIONS, HEAD, GET, POST, PUT, DELETE
     http.cors.allow-headers : X-Requested-With,X-Auth-Token,Content-Type, Content-Length
    
  4. Save the configuration file and start Elasticsearch again.

  5. Go to ElasticHQ and chose your preferred version of the tool (Cloud/Plugin/Download).

  6. Enter the address of your database and press Connect. Now you should be able to see the previously defined indices and the number of articles stored within them.

Optional arguments

news-please supports optional arguments that can be passed when starting the crawler. Start news-please with the -h parameter to see them.

Add own URLs

To add your own websites you have to either edit sitelist.hjson or create a new file and register it within the configuration. Both files are located in config directory.

If you want to created a new input file you have to add the path to the [Files] section of config.cfg:

url_input_file_name = sitelist.hjson

sitelist.hjson

The input file consists of one array called base_urls and each entry represents one website to be crawled:

{
	 "base_urls" : [
		{
			"url": "http://www.faz.net/",
			"crawler": "RecursiveCrawler",
			"overwrite_heuristics": {
			  "meta_contains_article_keyword": true,
			  "og_type": true,
			  "linked_headlines": true,
			  "self_linked_headlines": false
			  },
			"pass_heuristics_condition": "meta_contains_article_keyword or (og_type and linked_headlines)"
		},
		{
			"url": "http://www.nytimes.com/",
			"crawler": "RssCrawler",
			"daemonize": 3600
		},
		...
		
	]
}

Direct URL download and extraction

news-please also supports direct URL download, i.e., you can define a list of URLs each pointing to an actual article that should just be downloaded and extracted.

# Furthermore this is first of all the actual config file, but as default just filled with examples.
{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      "crawler": "Download",
      "url": [
        # Cubs win Championship ~03.11.2016
        "http://www.dailymail.co.uk/news/article-3899956/Chicago-Cubs-win-World-Series-epic-Game-7-showdown-Cleveland.html",
        "http://www.mirror.co.uk/sport/other-sports/american-sports/chicago-cubs-win-world-series-9185077",
        "https://www.theguardian.com/sport/2016/nov/03/world-series-game-7-chicago-cubs-cleveland-indians-mlb",
        "http://www.telegraph.co.uk/baseball/2016/11/03/chicago-cubs-break-108-year-curse-of-the-billy-goat-winning-worl/",
        "https://www.thesun.co.uk/sport/othersports/2106710/chicago-cubs-win-world-series-hillary-clinton-bill-murray-and-barack-obama-lead-celebrations-as-cubs-end-108-year-curse/",
        "http://www.bbc.com/sport/baseball/37857919"
      ],

      "overwrite_heuristics": {
        "meta_contains_article_keyword": true,
        "og_type": false,
        "linked_headlines": false,
        "self_linked_headlines": false
      }
    }

Website Object

The entries within base_urls may have up to four parameters defining the start point, the used crawler and the heuristics used to detect articles:

  • url: (string)
    A String defining the root URL to start crawling e.g. "http://example.com" .

Optional Parameters:

  • crawler: (string)
    The crawler used to collect the data. For all implemented crawlers see crawlers.

  • overwrite_heuristics: (dictionary, containing mixed types)
    This overwrites the default heuristics used to detect sites containing an article. news-please expects a dict containing heuristic names as keys and as value the condition necessary for articles to pass the heuristic.

    Depending on the return value of a heuristic, the condition can be a bool, a string, an int or a float.

    • bool:
      Acceptable conditions are True and False, but False will disable the heuristic!
    • string:
      Acceptable conditions are simple strings: "string_heuristic": "matched_value"
    • float/int:
      Acceptable conditions are strings that may contain one equality operator (<, >, <=, >=,=) and a number, e.g. "linked_headlines": "<=0.65".
      Do not put spaces between the equality operator an the number!

    For all implemented heuristic and their supported conditions see heuristics.

  • pass_heuristics_condition: (string)
    This overwrites the default boolean expression defining the evaluation of the used heuristics. After all heuristics are tested and returned True or False, this expression will be checked.

    It may contain any heuristics-name (e.g. og_type, overwrite_heuristics), the boolean operators (e.g. and, or, not) and parentheses ((, )).

    To disable a heuristic you can either set the condition False or skip it in pass_heuristics_condition.

  • daemonize: (int)
    If this parameter is set, the crawler will be started as a daemon. The value defines the seconds the crawler waits until scraping the target again. This parameter is only supported by the RSSCrawler.

  • additional_rss_daemonize: (int)
    If this parameter is set, an additional RSSCrawler is spawned for the same target. The value defines the seconds the crawler waits until scraping the target again. This parameter is not supported by the RSSCrawler.

Advanced Configuration

This guide covers most of the standard use cases, if your interested in more specialized configurations visit: