Skip to content

Scraper used for recording changes to Portland jail database

Notifications You must be signed in to change notification settings

NguyenDa18/Portland-Jail-Data-Crawler

Repository files navigation

Multnomah County Jail Crawler

scraper-pdx-jail

Portland Justice

Purpose

Crawl through bookings of PDX Jail Database for data analysis and data transparency purposes. Update data files with scheduled jobs courtesy of GitHub actions.

  • Visit Multnomah County Online Inmate Data website: use URL for all inmates in custody: Link
  • Scrape inmate names and booking dates and update csvs/inmate_bookings.csv file
  • Visit each inmate link and update csvs/inmate_details.csv with inmate details and total amounts for each type of charge against them
  • Update csvs/inmate_charges.csv with list of charges for all inmates
  • Update JSON files in counts folder with counts of each category daily

Scraper Details

  • Located at inmates_spider/inmates_spider/spiders/inmates.py
  • Generate Dataframe of inmates and booking dates and update csvs/inmate_bookings.csv, sort by descending order of booking dates
  • Follow each inmate's URL and generate metadata for each inmate, update inmates_charges MongoDB database with charge totals data

Using

  • BeautifulSoup
  • Pandas
  • GitHub Actions (for cron job running scraper)
  • MongoDB (using pymongo Python package)

Enhancements

  • Storing data to a Database
  • Optimizing crawling
  • Using Scrapy Spider instead of BeautifulSoup
  • Creating UI for viewing data
  • Send notification when a "red flag" is released

Running It Yourself

Prerequisite: Python 3 needs to be installed

  1. Clone repo
  2. Activate Virtual Environment
source venv/bin/activate
  1. Install dependencies in Virtual Environment
pip install -r requirements.txt
  1. Best way to experiment is using Jupyter Notebook:
jupyter notebook

Then run experimental code in Sandbox Notebook.ipynb

Releases

No releases published

Packages

No packages published