Skip to content

configuration

Felix Hamborg edited this page Oct 5, 2017 · 3 revisions

news-please configuration

This guide focuses on the extensive configuration possibilities of news-please and explains all sections of the configuration file config.cfg.

Structure and Syntax

config.cfg holds the settings for all the different scraper, heuristics and pipelines. The file is located by default at ~/news-please/config/config.cfg. However, it is also possible to pass a different config directory with the -c parameter:

$ news-please -c /my/custom/path

or

$ news-please -c ~/somewhere/in/userdir

The file is divided into different sections:

[section_name]

option_name = value

All values are parsed by ast.literal_eval. So if options have to be special data-types (array, dict, bool) they have to be submitted in correct python-syntax.

[test_section]

# Booleans
# (bools in python have to be uppercase first)
bool_string = true # This would become a string, because it is not uc-first.
bool_bool = True   # This would become a bool with value True.

# Dicts
dict_string = {"test_1" : True, "test_2" : true} # This would become a string, because of an ValueError: Malformed String. The reason is that test_2s true is written wrong.
dict_dict = {"test_1" : True, "test_2" : 1.1}    # This would become a dict where test_1 is bool (True) and test_2 is float (1.1)

# Strings
string_bool = True          # This would become a bool.
string_string = true        # This would become a string.
string_string_True = "True" # This would become a string as well. The quotation-marks will be stripped.

Sections

Crawler

The crawler section provides the default-settings for all sites. These settings can often be overwritten per site in the input_data.hjson.

  • default: (string)
    The default crawler to be used. All the implemented crawlers can be found here.

  • fallbacks: (dict containing strings)
    All crawlers check if a site is compatible. If a site is incompatible a defined fallback will be checked.
    This variable defines the fallbacks for the crawler in a dict.

    The key is the failing crawler and the value is the fallback-crawler:

     fallbacks = {
         "RssCrawler": None,
         "RecursiveSitemapCrawler": "RecursiveCrawler",
         "SitemapCrawler": "RecursiveCrawler",
         "RecursiveCrawler": None,
         "Download": None
         }
    
  • hours_to_pass_for_redownload_by_rss_crawler: (int)
    RSS-Crawlers are often run as a daemon. So if a site remains a long time in the RSS-file and should not downloaded for a specified amount of time, this time can be set here.

  • number_of_parallel_crawlers: (int)
    The number of threads to start. Every thread downloads one site. As soon as one thread terminates, the next site will be downloaded until finished.

  • number_of_parallel_daemons: (int) The number of daemons to run. Every daemon is another thread, but running in a loop. As soon as one daemon terminates, the next site in the queue will be started.

    This is additional to number_of_parallel_crawlers.

  • ignore_file_extensions: (string)
    URLs which end on any of the following file extensions are ignored for recursive crawling.

    Default: ignore_file_extensions = "(pdf)|(docx?)|(xlsx?)|(pptx?)|(epub)|(jpe?g)|(png)|(bmp)|(gif)|(tiff)|(webp)|(avi)|(mpe?g)|(mov)|(qt)|(webm)|(ogg)|(midi)|(mid)|(mp3)|(wav)|(zip)|(rar)|(exe)|(apk)|(css)"

  • ignore_regex: (string)
    URLs which match the following regex are ignored for recursive crawling.

  • sitemap_allow_subdomains: (bool)
    If True, any SitemapCrawler will try to crawl on the sitemap of the given domain including subdomains instead of a domain's main sitemap.

Heuristics

This section provides the default-settings on heuristics and how they should be used. These settings are often overwritten per site in input_data.hjson.

  • enabled_heuristics: (dict, containing mixed types)
    This option sets the default heuristics used to detect sites containing an article. news-please expects a dict containing heuristic names as keys and as value the condition necessary for articles to pass the heuristic.

    Depending on the return value of a heuristic, the condition can be a bool, a string, an int or a float.

    • bool:
      Acceptable conditions are True and False, but False will disable the heuristic!
    • string:
      Acceptable conditions are simple strings: "string_heuristic": "matched_value"
    • float/int:
      Acceptable conditions are strings that may contain one equality operator (<, >, <=, >=,=) and a number, e.g. "linked_headlines": "<=0.65".
      Do not put spaces between the equality operator an the number!

    Default: enabled_heuristics = {"og_type": True, "linked_headlines": "<=0.65", "self_linked_headlines": "<=0.56"}

    For all implemented heuristic and their supported conditions see heuristics.

  • pass_heuristics_condition: (string)
    This string holds a boolean expression defining the evaluation of the used heuristics. After all heuristics are tested and returned True or False, this expression will be checked.

    It may contain any heuristics-name (e.g. og_type, overwrite_heuristics), the boolean operators (e.g. and, or, not) and parentheses ((, )).

    Default: og_type and (self_linked_headlines or linked_headlines)

  • min_headlines_for_linked_test: (int)
    This option is for the linked_headlines-heuristic and cannot be overwritten by input_data.hjson. This option disables the heuristic (returns True) if a site doesn't contain enough headlines.

Files

This section is responsible for paths to the input files as well as the output files.

  • relative_to_start_processes_file: (bool)
    Toggles between either using the path to start_processes.py (True) or the path to this config file (False) for relative paths.

    This does not work for this config's 'Scrapy' section which is always relative to the dir the start_processes.py script is called from.

  • url_input: (string)
    The location of the input_data.hjson file.

  • local_data_directory: (string)
    The savepath of the files which will be downloaded. The save-path provides many the following interpolation options:

    Interpolation-String Meaning
    %time_downloaded(<code>) Current time at download. Will be replaced with strftime(<code>) where <code> is a string, further explained here.
    %time_execution(<code>) The time, when the crawler-execution was started. Will be replaced with strftime(<code>) where <code> is a string, further explained here.
    %timestamp_download Current time at download (unix-timestamp).
    %timestamp_execution The time, when the crawler-execution was started (unix-timestamp).
    %domain The domain of the crawled file not containing any subdomains (not including www as well)
    %appendmd5_domain(<size>) appends the md5 to %domain(< - 32 (md5 length) - 1 (_ as separator)>) if domain is longer than
    %md5_domain(<size>) First <size> chars of md5 hash of %domain.
    %full_domain The domain including subdomains.
    %url_directory_string(<size>) The first <size> chars of the directories on the server (e.g. http://panamapapers.sueddeutsche.de/articles/56f2c00da1bb8d3c3495aa0a/ would evaluate to articles_56f2c00da1bb8d3c3495aa0a, but stripped to <size> chars), no filename
    %md5_url_directory_string(<size>) First <size> chars of md5 hash of %url_directory_string()

| %url_file_name(<size>) | First <size> chars of the file name (without type) on the server (e.g. http://www.spiegel.de/wirtschaft/soziales/ttip-dokumente-leak-koennte-ende-der-geheimhaltung-markieren-a-1090466.html would evaluate to ttip-dokumente-leak-koennte-ende-der-geheimhaltung-markieren-a-1090466, stripped to chars). No filenames (indexes) will evaluate to index. | | %md5_url_file_name(<size>) | First <size> chars of md5 hash of %url_file_name. | | %max_url_file_name | First x chars of %url_file_name, so the entire savepath has a length of the max possible length for a windows file system (260 characters - 1 <NUL>). |

Default: `local_data_directory = ./data/%time_execution(%Y)/%time_execution(%m)/%time_execution(%d)/%appendmd5_full_domain(32)/%appendmd5_url_directory_string(60)_%appendmd5_max_url_file_name_%timestamp_download.html`
  • format_relative_path: (bool)
    Toggles whether leading ./ or .\ from above local_data_directory should be removed when saving the path into the Database. If true ./data would become data

MySQL

  • host: (string)
    The host of the database.

  • port: (int)
    The port the MariaDB / MySQL-server is running on.

  • db: (string)
    The database, the news-please should use.

  • username: (string)
    The username to connect to the database.

  • password: (string)
    The password matching the user to connect to the database. If your password consists of only numbers enclose it in quotes to ensure it is handled as a string, e.g password = '123456'.

Elasticsearch

  • host: (string)
    The host of the database.

  • port: (int)
    The port Elasticsearch is running on.

  • index_current: (string)
    Index used to store the scraped articles.

  • index_archive: (string)
    Index used to store older versions of articles hold in index_current.

  • mapping: (string)
    Mapping used for the articles stored in Elasticsearch. The mapping declares the types, format and how the values are indexed for each field. For more information about mapping in Elasticsearch visit this guide.

ArticleMasterExtractor

This section holds the settings for the ArticleMasterExtractor responsible for the extraction of meta data from raw html responses.

  • extractors: (list of strings)
    A list of extractors used to process the scraped websites and extract the data. For a list of all implemented extractors visit ArticleMasterExtractor.

    Default: extractors = ['newspaper_extractor', 'readability_extractor', 'date_extractor', 'lang_detect_extractor']

DateFilter

This section holds the settings for the DateFilter module, the module filters articles based on their publish date and a given time interval.

  • start_date: (string)
    A date defining the start of the allowed time interval, the date has to follow the format 'yyyy-mm-dd hh:mm:ss'. It is also possible to set this variable to None creating a half-bounded interval.

  • end_date: (string)
    A date defining the end of the allowed time interval, the date has to follow the format 'yyyy-mm-dd hh:mm:ss'. It is also possible to set this variable to None creating a half-bounded interval.

  • strict_mode: (bool)
    Enables strict mode, this will filter all articles without a publishing date.

Scrapy

This section is a replacement for the settings.py provided by scrapy. You can just paste all the scrapy-settigns here. A list of options can be found here. These settings will be applied to all crawlers.

  • ITEM_PIPELINES: (dict string:int)
    Holds the path to the pipeline modules used and their position in the pipeline as key. The possible positions range from 0 to 1000 and the module with the lowest position is executed first.

    A list of all pipeline modules can be found here.