Skip to content

ka-chang/StateLegiscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StateLegiscraper

A webscraping tool for U.S. state legislature websites that exports and processes committee hearing data for text analysis. State coverage is outlined here.

Mission

The mission of StateLegiscraper is to make accessible text corpora of political, social, and scholarly significance that can build greater public transparency and academic knowledge about public policymaking and state-level politics.

Project Objective

Public oversight of the policymaking process is an important cornerstone of democratic nations. As the current U.S. political climate has increasingly shifted national politics to the state-level, state legislatures are key policy venues to watch. In particular, committee hearings are rich sources of data that captures crucial elements of the policy process, such as interactions between policy actors, strategic use of policy narratives, issue framing, just to name a few.

However, each of the 50 state legislatures have vastly different websites and public documentation protocols. Committee hearings in each state are archived in a variety of formats (e.g., PDF, audio, or video), making it difficult for most users to access data about the work that happens during the hearing process. Therefore, a systemic examination of within state and national trends of state legislature is difficult to execute due to challenges in navigating, accessing, and processing relevant data at scale and across time. While projects such as LegiScan, Civic Eagle, and Open States have APIs that provide data about bills and representatives across all 50 states, there is currently no open source option that scrapes and processes written and spoken transcripts of state legislature commitee hearings for research purposes and public review.

StateLegiscraper is an open-source tool that fills this data access gap by making all publically available state legislature committee hearing data easily accessible regardless of its archived format. PDFs are scraped and converted to text, while audio and video are processed through an open-source speech-to-text engine. The final outputs are text transcripts for selected committee hearings and legislative sessions that are ready to work with popular Python NLP packages such as nltk and spaCy.

Repository Structure

.
├── data
│   └── dashboard
├── doc
├── examples
├── statelegiscraper
│   ├── assets
│   ├── helpers
│   ├── states
│   └── test
├── LICENSE
├── README.md
└── environment.yml

The statelegiscraper directory includes a states module, unit tests in test, and a helpers module that adds closed-source speech-to-text functionality with Google Cloud. Data relevant to dashboards are included in data directory. The examples directory provides example Jupyter notebooks that can help new users learn the ways StateLegiscraper organize scraping and processing.

Installation

StateLegiscraper is installed using the command line and is best used with a virtual environment due to its dependencies.

  1. Open your choice of terminal (e.g., Terminal (MacOS) or Ubuntu 20.04 LTS (Windows))
  2. Clone the repoistory using git clone https://github.com/ka-chang/StateLegiscraper.git
  3. Change to the StateLegiscraper directory using cd StateLegiscraper
  4. Set up a new virtual environment with all necessary packages and their dependencies using conda env create -f environment.yml
  5. Activate the statelegiscraper virtual environment with conda activate statelegiscraper
  6. Deactivate the statelegiscraper virtual environment using conda deactivate

Requirements

StateLegiscraper requires the manual download of two categories of files: (a.) Google Chrome and Chrome Driver, and (b.) DeepSpeech Model files.

Google Chrome and Chrome Driver

StateLegiscraper's webscraping tool uses a Python-based web browser automation tool, Selenium. This requires a specific browser and browser driver to work properly. The package is built using Google Chrome.

To check your installed Chrome version and to download the appropriate Chrome Driver, follow these instructions:

  1. Open Google Chrome
  2. At the top right corner of the browser, click the settings tab (three vertical dots ⋮)
  3. Navigate down to Help > About Google Chrome
  4. Your Google Chrome version is listed on the top of the page. For example:

  1. Find the Chrome Driver that corresponds to your version and save it to your local drive. We recommend saving it within the cloned repository directory statelegiscraper/assets for organizational purposes.

DeepSpeech Model Files

StateLegiscraper uses an open-source speech-to-text engine called DeepSpeech to process audio files to text transcripts. DeepSpeech requires acoustic models to run, which StateLegiscraper's audio_to_text functions require. You can read more about DeepSpeech's acoustic models in their release notes published here.

To download DeepSpeech's v.0.9.3 models and v.0.9.3 model scorer, follow these instructions in your terminal of choice:

  1. Navigate the the assets folder in the statelegiscraper package using cd statelegiscraper/assets.
  2. Download the DeepSpeech's v.0.9.3 models into the assets directory using curl -o https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
  3. Download the DeepSpeech's v.0.9.3 model scorer into the assets directory using curl -o https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Usage

Tool

StateLegiscraper contains U.S. state-specific modules that each contain two classes of functions: a Scrape class and a Process class.

  • The Scrape class bundles functions that scrape U.S. state legislature websites for individual committee hearing and floor speech PDF / audio / video transcript links. Users export this raw data to their local drive or a mounted cloud drive.
  • The Process class bundles functions that cleans and formats the raw scraped data into Python objects appropriate to use for popular NLP packages (e.g., nltk, SpaCy). Scraped PDF files will be converted to dictionary objects, while audio and video files will use Deep Speech, an open-source speech-to-text engine, to generate a text transcript of selected meetings. These transcripts can be used as dictionary objects, or exported as a JSON file.

Example Jupyter notebooks are provided in the examples directory that walk new users through StateLegiscraper's scrape and process functions, including expected behavior from Selenium and file management strategies.

Dashboard

StateLegiscraper is also designed to support public-facing dashboards using the scraped state legislature data. The intention is for these dashboards to provide interested users about high-level narrative trends within a specific state and/or policy area. The Dashbaord are currently under development using Streamlit, and will be published in this section when updated.

Use Cases

Researchers can gather raw data for nuanced, tailored analysis, while members of the public can engage with our text analysis dashboards to capture high-level trends in the political discourse at the state legislature. Read detailed user stories here.

Requests

The ambition of StateLegiscraper is to one day cover and maintain all 50 state legislature websites. If you'd like to request a state, build a dashboard, or suggest a feature to extend the functionality of StateLegiscraper, please feel free to raise an issue.

Bug Report

Achieving broad and stable coverage for StateLegiscraper is a priority. States in active development are expected to have bugs, but bug reports for states identified as stable on the state coverage documentation would be much appreciated. If you would like to report a bug or issue, please submit a detailed report at this link.

Contributions

If you'd like to expand StateLegiscraper to other states, use the data to add to our dashboard options, or add additional features to the tool, please fork the repository, add your contribution, and generate a pull request. The complete contributing guide can be found at this link. This project operates under the Contributor Code of Conduct.

Acknowledgements

Many thanks to Dr. David Beck and Anant Mittal from the University of Washington for their support, guidance, and feedback in the development of this package.

StateLegiscraper logo adapted from Icon8 icons.

About

A webscraper tool that collects data from U.S. state legislatures websites for text analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •