Skip to content

pipeline

Felix Hamborg edited this page Oct 5, 2017 · 5 revisions

news-please pipeline

The news-please pipeline offers several modules for processing, filtering and storing the results of the crawlers. This section explains the different pipeline modules and their configuration.

Processing

ArticleMasterExtractor

  • Module path: newscrawler.pipeline.pipelines.ArticleMasterExtractor

  • Functionality:
    The ArticleMasterExtractor bundles several tools into one pipeline module in order to extract meta data from raw articles. Based on the html response of the processed pipeline item it extracts:

    • author
    • date the article was published
    • article title
    • article description
    • article text
    • top image
    • used language
  • Configuration:
    While the module works fine with the default settings, it is possible reconfigure the tools used in the extraction process. These changes can be performed in the ArticleMasterExtractor-section of the config file.

    More detailed information about the module and the incorporated extractors can be found here.

Filter

Date filter

  • Module path: newscrawler.pipeline.pipelines.DateFilter

  • Functionality:
    This module filters the extracted articles based on their publishing date. It allows to filter all articles younger than a start date and/or older than an end date. It also implements a strict mode that dropps all articles without an extracted publishing date.

  • Requirements:
    Due to need of meta data (the publishing date), the module only functions if placed behind an suitable extractor in the pipeline.

  • Configuration:
    The configuration is done in the DateFilter Section of newscrawler.cfg:

     #!python
     [DateFilter]
     start_date = '1999-01-01 00:00:00'
     end_date = '2999-12-31 00:00:00'  
    
     strict_mode = False
    

    Dates can be either None or date string with the following format: 'yyyy-mm-dd hh:mm:ss'

HTML code handling

  • Module path: newscrawler.pipeline.pipelines.HMTLCodeHandling

  • Functionality:
    This Module checks the server responses and drops the processed site if the request was not accepted. As of 22.06.16 this module is not active, but serves as an example pipeline module.

Storage

Local storage

  • Module path: newscrawler.pipeline.pipelines.LocalStorage

  • Functionality:

Elasticsearch storage

  • Module path: newscrawler.pipeline.pipelines.ElasticsearchStorage

  • Functionality:
    This Modules stores the extracted data in a given Elasticsearch database. It manages two separate indices, one for current articles and one to archive previous versions of updated articles. Both indices use the following default mapping to store the articles and extracted meta data:

     mapping = {
         'url': {'type': 'string', 'index': 'not_analyzed'},
         'sourceDomain': {'type': 'string', 'index': 'not_analyzed'},
         'pageTitle': {'type': 'string'},
         'rss_title': {'type': 'string'},
         'localpath': {'type': 'string', 'index' : 'not_analyzed'},
         'ancestor': {'type': 'string'},
         'descendant': {'type': 'string'},
         'version': {'type': 'long'},
         'downloadDate': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
         'modifiedDate': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
         'publish_date': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
         'title': {'type': 'string'},
         'description': {'type': 'string'},
         'text': {'type': 'string'},
         'author': {'type': 'string'},
         'image': {'type': 'string', 'index' : 'not_analyzed'},
         'language': {'type': 'string', 'index' : 'not_analyzed'}
         }
    
  • Configuration:
    To use this module you have to enter the address, the used port and if needed your user credentials into the Elasticsearch Section of newscrawler.cfg. There you can also alter the name of the indices and the mapping used to store the article data.

MySQL storage

  • Module path: newscrawler.pipeline.pipelines.MySQLStorage

  • Functionality:
    This Modules stores the extracted data in a given MySQL or MariaDB database. It manages two separate tables, one for current articles and one to archive previous versions of updated articles:

    MySQL er diagram

  • Configuration:
    To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section of newscrawler.cfg. There is also a setup script init-db.sql for a convenient creation of the used tables.

RSS crawl compare

  • Module path: newscrawler.pipeline.pipelines.RSSCrawlCompare

  • Functionality:
    Similar to the MySQL storage module, this module works with MySQL or MariaDB databases. But different to the MySQL module, it only works with articles returned from the Rss crawler.

    For every passed article the module looks for an older version in the database and updates the Fields if certain time has passed since the last update/download. This module won't save new articles and is only meant to keep the database up to date.

  • Configuration:
    To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section of newscrawler.cfg. To setup the used tables simply execute the provided setup script init-db.sql. You can also alter the interval articles are updated with the hours_to_pass_for_redownload_by_rss_crawler -parameter in the Crawler section