Multnomah County Jail Crawler

Purpose

Crawl through bookings of PDX Jail Database for data analysis and data transparency purposes. Update data files with scheduled jobs courtesy of GitHub actions.

Visit Multnomah County Online Inmate Data website: use URL for all inmates in custody: Link
Scrape inmate names and booking dates and update csvs/inmate_bookings.csv file
Visit each inmate link and update csvs/inmate_details.csv with inmate details and total amounts for each type of charge against them
Update csvs/inmate_charges.csv with list of charges for all inmates
Update JSON files in counts folder with counts of each category daily

Scraper Details

Located at inmates_spider/inmates_spider/spiders/inmates.py
Generate Dataframe of inmates and booking dates and update csvs/inmate_bookings.csv, sort by descending order of booking dates
Follow each inmate's URL and generate metadata for each inmate, update inmates_charges MongoDB database with charge totals data

Using

BeautifulSoup
Pandas
GitHub Actions (for cron job running scraper)
MongoDB (using pymongo Python package)

Enhancements

Storing data to a Database
Optimizing crawling
Using Scrapy Spider instead of BeautifulSoup
Creating UI for viewing data
Send notification when a "red flag" is released

Running It Yourself

Prerequisite: Python 3 needs to be installed

Clone repo
Activate Virtual Environment

source venv/bin/activate

Install dependencies in Virtual Environment

pip install -r requirements.txt

Best way to experiment is using Jupyter Notebook:

jupyter notebook

Then run experimental code in Sandbox Notebook.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 5,955 Commits
.github/workflows		.github/workflows
counts		counts
csvs		csvs
inmates_spider		inmates_spider
utils		utils
.gitignore		.gitignore
Charges Analysis.ipynb		Charges Analysis.ipynb
Exploration.ipynb		Exploration.ipynb
Inmates Charges Analysis.ipynb		Inmates Charges Analysis.ipynb
README.md		README.md
Sandbox Notebook.ipynb		Sandbox Notebook.ipynb
inmate_counts_analysis.ipynb		inmate_counts_analysis.ipynb
requirements.txt		requirements.txt
scraper.py		scraper.py

NguyenDa18/Portland-Jail-Data-Crawler

Folders and files

Latest commit

History

Repository files navigation

Multnomah County Jail Crawler

Purpose

Scraper Details

Using

Enhancements

Running It Yourself

About

Topics

Resources

Stars

Watchers

Forks

Languages