Austrian Aid Scraper

The scraper extracts information from the austrian development projects since 2010 from the austrian development agency website. The automatically extracted informations are stored in CSV and JSON files to make the further usage as easy as possible.

This repository provides the code and documentation and keeps track of bugs as well as feature requests.

Data Source
Team: Gute Taten für gute Daten project of Open Knowledge Austria
Status: Production
Documentation: English
Licenses:
- Content: Creative Commons Attribution 4.0
- Software: MIT License

Used software

The sourcecode is written in Python 2. It was created with use of iPython, BeautifulSoup4 and urllib2.

SCRAPER

Description

The scraper fetches the overview page html with the table, stores it locally and parses out the data with beautifulsoup4. Then the scraper downloads every aid project entry and parses out the description from it. At the end, the data is stored as JSON and CSV files for easy usage later on.

Run scraper

Go into the root folder of this repository and execute following commands in your terminal:

cd code
python aid-scraper.py

Original sourcecode

Thanks to Christian Goebel for the original sourcecode, which got used for the final version.

How the scraper works

Configure the Scraper

There are two global variables in aid-scraper.py you may want to change to your needs.

DELAY_TIME: To not overload the server or may get blocked because of too many request, you should set the delay time to fetch to 1-5 seconds, not less.
TS: The timestamp as a string can be set to the last download. So you can use downloaded data over and over again and must not do it everytime. When you do it first time, you can set the value to datetime.now().strftime('%Y-%m-%d-%H-%M'), so it is the timestamp when the scraper starts.

Download raw html

Here all the html raw data gets downloaded, stored locally and the basic data gets parsed.

Download all overview pages with the tables (html). The navigation for the fetching runs through all overview pages by asking the existance of the "weiter" anchor and counting up an url variable.
Open the downloaded files.
Parse out the basic information about each project from the overview tables. This is necessary here, because the download of the project page needs the link from the overview table.
Store the parsed data as JSON file.
Download all project pages (html).

Parse html

Here the description of the project gets added to the data.

Open the JSON data.
Open the project-pages files (html).
Parse out the additional description information from the project pages.
Store updated data as JSON file.

Export as CSV

Here the data gets exported as a CSV file.

Open the data (JSON).
Save the serialized data as CSV file.

DATA INPUT

The original data is from the project list of the austrian development agency (ADA) published on their website. The data consists of all contracts approved since January 1st of 2010. in the list in chronologically descending order. The date of the last update can be found on the first table page as "Datum der letzten Aktualisierung".

The Tables

The tables are the basic data, where most of the data is parsed out. The data is published in the following structure (e. g. first project).

Vertragsnummer	Vertragstitel	Land/Region	OEZA/ADA-Vertragssumme	Vertragspartner
2325-02/2016	Programm zum Schutz der MenschenrechtsverteidigerInnen in der westlichen Region Guatemalas	Guatemala	EUR 64.300,00	HORIZONT3000 - Österreichische Organisation für Entwicklungszusammena

Attributes

Vertragsnummer: contract number of project.
Vertragstitel: title of project.
Land/Region: country or region, where project takes place at.
OEZA/ADA-Vertragssumme: amount of money granted by contract.
Vertragspartner: partner(s) in the project.

The project pages

When you click on the contract titel in a table you get to the project page. It consists of the same data as the table view, except the additional description text (named "Beschreibung").

Soundness

So far, we can not say anything about the data quality (completeness, accuracy, etc.), but there are also so far no reasons to doubt the quality.

Data errors found

DATA OUTPUT

raw html

The scraper downloads all raw html of each table and each project page.

aid data JSON

The parsed data is stored in an easy-to-read JSON file for further usage.

[
	{
		'contract-number': contract number of the project
		'contract-title': title of the project
		'country-region': country and/or region, where the project takes place
		'OEZA-ADA-contract-volume': amount of funding by austrian development agency
		'contract-partner': partner organisation(s)
		'description': description text of the project
		'url': url of the project page
	},
]

aid data csv

The parsed data is stored in a human-readable CSV file for further usage.

columns (see attribute description above):

contract-number
contract-title
OEZA-ADA-contract-volume
contract-partner
country-region
description
url

row: one project each row.

CONTRIBUTION

In the spirit of free software, everyone is encouraged to help improve this project.

Here are some ways you can contribute:

by reporting bugs
by suggesting new features
by translating to a new language
by writing or editing documentation
by analyzing the data
by visualizing the data
by writing code (no pull request is too small: fix typos in the user interface, add code comments, clean up inconsistent whitespace)
by refactoring code
by closing issues
by reviewing pull requests
by enriching the data with other data sources

When you are ready, submit a pull request.

Submitting an Issue

We use the GitHub issue tracker to track bugs and features. Before submitting a bug report or feature request, check to make sure it hasn't already been submitted. When submitting a bug report, please try to provide a screenshot that demonstrates the problem.

COPYRIGHT

All content is openly licensed under the Creative Commons Attribution 4.0 license, unless otherwisely stated.

All sourcecode is free software: you can redistribute it and/or modify it under the terms of the MIT License.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Visit http://opensource.org/licenses/MIT to learn more about the MIT License.

SOURCES

Original Data

Aid

Dreijahresprogramm der österreichischen Entwicklungspolitik 2016 bis 2018

Documentation

REPOSITORY

README.md: Overview of repository
code/aid-scraper.py: scraper
CHANGELOG.md
LICENSE

CHANGELOG

See the whole history. Next the actual version.

Version 0.3 - 2016-04-19

extended scraper

aid-scraper.py: fixed the csv output bug caused by cariage return characters.
update the README.md: add description of scraper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

CHANGELOG.md

CHANGELOG.md

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Austrian Aid Scraper

SCRAPER

How the scraper works

DATA INPUT

The Tables

The project pages

Soundness

DATA OUTPUT

CONTRIBUTION

Submitting an Issue

COPYRIGHT

SOURCES

REPOSITORY

CHANGELOG

Version 0.3 - 2016-04-19

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

License

OKFNat/aidScraper

Folders and files

Latest commit

History

Repository files navigation

Austrian Aid Scraper

SCRAPER

How the scraper works

DATA INPUT

The Tables

The project pages

Soundness

DATA OUTPUT

CONTRIBUTION

Submitting an Issue

COPYRIGHT

SOURCES

REPOSITORY

CHANGELOG

Version 0.3 - 2016-04-19

About

Topics

Resources

License

Stars

Watchers

Forks

Languages