Skip to content

Abaqueues/twfy-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

twfy-scraper

Description

A Python web scraper that connects to the TheyWorkForYou scraped .xml index for Hansard parliamentary debates. The web scraper collates a list of debates after a given year and proceeds to download and save each of the .xml files in the list.

Development of this web scraper was necessary after numerous unsuccessful attempts to utilise the TheyWorkForYou API to gather parliamentary debate data. Further attempts using ajparsons's Python binding (twfy-python, https://github.com/ajparsons/twfy-python) were also unsuccessful - while the binding could return data using the .getMPs function call, the .getDebates function call failed to garner a response from the server.

Usage

Modify line 13's variable according to limit (by year) the number of files you want to download from the website.

Roadmap

Currently, the web scraper must compile the list of .xml files each time it is run. Ideally, I would like this list to be output to a .txt file - if the first result of the query is the same as the first file in the .txt file, compilation of the list can be skipped.

About

ParliaSent module containing a web scraper gathering Hansard debate transcript in a raw XML data format.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages