A Python data collector Cookiecutter template. The scraper is designed to work in a ScraperWiki "box", however it can be deployed virtually in any Unix environment. For detailed documentation about how to create and manage scrapers on ScraperWiki please refer to its official documentation.
Generate a new collector:
cookiecutter https://github.com/reubano/cookiecutter-collector.git
Then:
- Edit config.py.
- Edit app/utils.py.
- Edit app/models.py.
- Run manage setup to create the db.
- Run manage run to populate the db.
The default way to use ScrapeWiki is to store data in a SQLite database named scraperwiki.sqlite in the user's root directory. This enables a series of features such as an interactive SQL querier, an html table view with filters, API endpoints for making remote SQL queries, etc.
The folder structure is as follows:
collector-skeleton
+---LICENSE
+---Makefile
+---README.md
+---app
| +---__init__.py
| +---models.py
| +---utils.py
+---bin
| +---check-stage
| +---upload
| +---setup
+---config.py
+---dev-requirements.txt
+---http
| +---index.html
+---manage.py
+---requirements.txt
+---setup.cfg
+---setup.py
+---tests
+---__init__.py
+---standard.rc
+---test.sh
manage.py
contains the main script commands.config.py
contains the configuration settings.http
generally contains an index.html file with the summary of the scraping task and any other files that are intended to be available through an API endpoint, such as a log.txt file.app
contains the collector model and initialization.
- reubano/hdxscraper-acled: Armed Conflict Location & Event Data Project (ACLED) Realtime Data collector.
- reubano/hdxscraper-fao: Food Aid Organization Data collector.
- reubano/hdxscraper-fts: UN Financial Tracking Service (FTS) API collector.
I will glady accept pull requests if they improve the collector development experience.