parliament-scaper

Public Data Scraper for Parliament Data for the EU and other Parliaments

Ruby Based Crawler Setup

Install git (if not present already)
Clone project using git clone https://github.com/fossasia/parliament-scaper.git
Install Ruby (version >= 2.1) and Bundler
Run bundle install to install the required gems
Run the script using ruby eu_scraper.rb or ./eu_scraper.rb
Find the scraped questions in the docs/ folder

Technologies Used in Ruby crawler:

Ruby - The Language
Nokogiri - For HTML Parsing

Scala-based Asynchronous crawler Setup

Install sbt, git and latest version of scala(sbt will do the update for you)
git clone https://github.com/DengYiping/parliament-scaper.git
sbt run
sbt will first automatically download the necessary dependencies, and it will run the script.

Technologies Used in Scala crawler:

Scala: a functional programming language on JVM
Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
Spray-client: a light-weighted HTTP client based on Akka Actor model.

Python Based Crawler Setup

Install the requirements for this crawler pip install -r requirements.txt
Run $ python eu_scraper.py

Technologies Used in Python Crawler:

Requests library
lxml library for DOM traversal

Python-async parser setup

Create a virtual environment inside python-async folder with virtualenv --python=python3.4 venv
Activate you virtual environment with source venv/bin/activate
Install all appropriate requirements with pip install -r requirements.txt
Run the parser with $ python parser.py

Changing the parser behavior

Change YEARS_TO_PARSE in order to parse data from different years
Change FOLDER_TO_DOWNLOAD in order to change the name of the folder to download the data into.

Technologies Used in Python-async parser:

Requests + requests-futures for async requests
threading for async downloading
beautifulsoup4 for DOM parsing
tqdm for progress bar

Python-Based Scraper (pol's scraper)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of questions to be scraped.

Install the requirements pip install -r requirements.txt
Run $ python scraper.py

Scrape it all - Generic Scraper(pol's scraper 2)

This scraper uses the BeautifulSoup package to parse and extract data from parliament's site. The script can also calculate how many pages it has to download based on the number of docs to be scraped.

Generic Scraper - All years, All languages. Scrapes entire database.

Install the requirements pip install -r requirements.txt
Run $ python scrape_it_all.py

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
GenericScraper		GenericScraper
data/EUP2015		data/EUP2015
pol		pol
python-async		python-async
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
application.conf		application.conf
build.sbt		build.sbt
crawler.scala		crawler.scala
eu-scraper.py		eu-scraper.py
eu_scraper.rb		eu_scraper.rb
requirements.txt		requirements.txt

License

OpnTec/parliament-scraper

Folders and files

Latest commit

History

Repository files navigation

parliament-scaper

Ruby Based Crawler Setup

Technologies Used in Ruby crawler:

Scala-based Asynchronous crawler Setup

Technologies Used in Scala crawler:

Python Based Crawler Setup

Technologies Used in Python Crawler:

Python-async parser setup

Technologies Used in Python-async parser:

Python-Based Scraper (pol's scraper)

Scrape it all - Generic Scraper(pol's scraper 2)

About

Resources

License

Stars

Watchers

Forks

Languages