Skip to content

internetarchive/scrapy-warcio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrapy Warcio

A Web Archive WARC I/O module for Scrapy

travis-ci

Install

$ pip install scrapy-warcio

Usage

  1. Create a project and spider:
    https://docs.scrapy.org/en/latest/intro/tutorial.html
$ scrapy startproject <project>
$ cd <project>
$ scrapy genspider <spider> example.com
  1. Copy and edit scrapy_warcio distributed settings.yml with your configuration settings:
---
warc_spec: https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
max_warc_size: 10000000000  # 10GB

collection: ~ # collection name
description: ~ # collection description
operator: ~ # operator email address
robots: ~  # robots policy (obey or ignore)
user_agent: ~ # your user-agent
warc_prefix: ~ # WARC filename prefix
warc_dest: ~ # WARC files destination
...
  1. Export SCRAPY_WARCIO_SETTINGS=/path/to/settings.yml

  2. Add WarcioDownloaderMiddleware (distributed as middlewares.py) to your <project>/<project>/middlewares.py:

import scrapy_warcio


class WarcioDownloaderMiddleware:

    def __init__(self):
        self.warcio = scrapy_warcio.ScrapyWarcIo()

    def process_request(self, request, spider):
        request.meta['WARC-Date'] = scrapy_warcio.warc_date()
        return None

    def process_response(self, request, response, spider):
        self.warcio.write(response, request)
        return response
  1. Enable WarcioDownloaderMiddleware in <project>/<project>/settings.py:
DOWNLOADER_MIDDLEWARES = {
    '<project>.middlewares.WarcioDownloaderMiddleware': 543,
}
  1. Validate your warcs with internetarchive/warctools:
$ warcvalid WARC.warc.gz
  1. Upload your WARC(s) to your favorite web archive!

Help

$ pydoc scrapy_warcio

or

>>> help(scrapy_warcio)

TODO

Making this a Scrapy extension may make it more useful:
https://docs.scrapy.org/en/latest/topics/extensions.html

@internetarchive