Team project for Georgia Tech Masters in Analytics, Spring 2022 - CSE 6242: Data & Visual Analytics
Citi Bike is a bike sharing service in New York City with over 24,500 bikes and 1,500 bike stations. Project goal is to provide holistic insights and visualizations for Citi Bike trends and factors impacting ridership behavior to empower city and transit planning
This github repository contains the code for data pipeline, visualization, and website generation
Team Members:
- Kevin Schneider
- Roshni Mahtani
- Stephanie Chueh
- Brent Brewington
Data is ingested into BigQuery via automation within GitHub Actions connected to this repo. Details in file: citibike_trip_history.yml, and Actions executions can be viewed in "Actions" tab in this repo
The approach with BigQuery data in this project is "ELT", or "Extract, Load, Transform":
- Extract & Load
- Weather Data: manually sourced from Weather Underground; loaded into BigQuery
- Neighborhood Attributes (a.k.a. "GEO"): manually sourced from raw Citibike Trip Data, cleaned, and aggregated to neighborhood; loaded into BigQuery
- Citibike Trip Data: raw data extracted from AWS bucket tripdata via /src/copy_aws_to_gcs.py (executed in github action with command line arguments: citibike_trip_history.yml) - lands in staging dataset in BigQuery
- Transform
- Once raw data staged in BigQuery, the tool "dbt" in folder /citibike_dbt orchestrates a sequence of queries going from raw data to intermediate tables, and outputting final clean tables at defined granularities. There are also some data cleaning rules and assumptions built in via tests in this step.
- dbt project documentation is published here: https://bbrewington.github.io/gatech-cse6242-citibike/dbt_docs.html
-
Weather Analysis Notebooks
- src/WeatherAnalysis_LinearRegression_R_Final.ipynb: Analysis document walking through steps to predict weather impact on ridership via Linear Regression
- src/WeatherAnalysis_RandomForest_Python_Final.ipynb: Analysis document walking through steps to predict weather impact on ridership via Random Forest
-
Analysis files used in website
- src/visualizations/choropleth_by_year.py: creates map of ridership by zip by year (year drop-down selection)
- src/visualizations/choropleth_timeofday.py: creates map of ridership by zip by time of day (time of day drop-down selection)
- src/visualizations/stations_and_total_rides_scatterplot.py: creates animated scatterplot showing rides by neighborhood by year
- src/visualizations/weather_factors_impact.py: creates bar plots exploring how weather factors impact ridership (weather factor selection via drop-down)
-
Other Python files
- src/copy_aws_to_gcs.py: This is the script that orchestrates getting the data from AWS and staging in GCS
- src/dbt_utility.py: Python code to be run manually as described below in appendix
- src/gcs_to_gbq.py: Python code to load staged GCS data into BigQuery staging dataset (which is referenced by dbt)
Add these credentials to {repo_url}/settings/secrets/actions:
- AWS API Credentials
AWS_ACCESS_KEY
AWS_SECRET_ACCESS_KEY
- Google Cloud service account JSON key
GCP_CREDENTIALS
-
cd into
/citibike_dbt
-
update & test dbt model, and update docs:
dbt run dbt test dbt docs generate
This will output file:
/citibike_dbt/target/index.html
which is intended to be viewed interactively after runningdbt docs serve
, but this doesn't work well with github pages, so we use the next step to generate a compatible static page -
cd into root dir; run
python3 src/dbt_utility.py