Biking in the Big Apple: Trends and Insights in New York City's Citi Bike Program

Project Overview 🚀

This project analyzes data from New York City's Citi Bike share program for the years 2019, 2020, and 2023. The primary goal is to investigate ridership trends before the 2020 pandemic, during 2020, and in the year 2023. The project aims to answer the following questions:

How did the 2020 pandemic impact ridership, and did it rebound significantly by 2023?
Are there any differences or similarities in monthly usage trends for each year, particularly during the 2020 pandemic?
What is the comparison between members of the bike program and non-members in terms of total usage across all three years combined?
Which bike stations are the most popular, and what are the total trips and average trip times for each of the top 5 stations?

By analyzing the data, we can gain insights into the popularity of certain areas in New York City among Citi Bike users and determine if the most popular stations change over time, even as Citi Bike relocates bike stations. The project also examines whether the average trip time for users changes significantly when starting and completing trips at these popular stations.

Through this analysis, we aim to better understand the impact of the pandemic on bike-sharing in New York City and identify any lasting changes in ridership patterns and user behavior.

Tech Stack 🖥

Terraform: Used for Infrastructure as Code (IaC) to create a Google Cloud Storage (GCS) bucket and BigQuery dataset
Google Cloud Storage (GCS): Used as a data lake to store the raw Citi Bike data in parquet format
Google BigQuery: Used as the data warehouse for storing and analyzing the Citi Bike data
PyArrow: Used in conjunction with the Data Load Tool (dlt) for the initial data pipeline
Data Load Tool (dlt): Utilized for the ETL process, creating data pipelines, dbt transformations, and setting up GitHub Actions
dbt (Data Build Tool): Employed for performing data transformations and running tests on the data stored in BigQuery
GitHub Actions: Integrated with dlt to deploy a monthly triggered data pipeline
Google Looker Studio: Used for creating a dashboard to visualize the analyzed Citi Bike data

Data Structure 🧱

Raw Tables

2019	2020	2023
tripduration	tripduration	-
startime	startime	started_at
stoptime	stoptime	ended_at
start station id	start station id	start_station_id
start station name	start station name	start_station_name
start station latitude	start station latitude	start_lat
start station longitude	start station longitude	start_lng
end station id	end station id	end_station_id
end station name	end station name	end_station_name
end station latitude	end station latitude	end_lat
end station longitude	end station longitude	end_lng
bikeid	bikeid	ride_id
usertype	usertype	member_casual
birth year	birth year	-
gender	gender	-
-	-	rideable_type

Number of rows

2019 - 20,551,697
2020 - 19,506,857
2023 - 36,226,600

Final Materialized Table

2019 - 2023 (Combined)
ride_id
start_time_date
stop_time_date
trip_duration_seconds
start_station_name
stop_station_name
user_category

Number of rows

76,285,697

Analysis 📊

Link to dashboard is here. (If the dashboard doesn't display, it's due to the GCS credits ending)

Instructions to Replicate the Project 🧭

Step 1: Set up Google Cloud

Create a Google Cloud account if you don't have one already. Google Cloud offers a $300 credit that's valid for 90 days, which can be used for this project.
Create a new project and project ID within Google Cloud. You can do this either through the Google Cloud Console or using the Google Cloud SDK CLI.
If you prefer the CLI approach, follow the steps in this video.
Create a Service Account through the IAM (Identity and Access Management) section in Google Cloud Console. Assign the following permission roles to the service account:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
- Owner (if needed)
Generate and download a JSON key for the newly created Service Account. It will be used when interacting with Terraform, dbt and dlt scripts, and if using Google Cloud SDK CLI.

Note: Save the JSON key to a local folder. If the folder is located within the project directory, consider adding it to your .gitignore file to prevent the key from being accidentally committed to version control.

Step 2: Clone this repo

Clone me:

git clone https://github.com/agutiernc/nyc-citi-bike-insights.git

Switch to project directory
```
cd nyc-citi-bike-insights
```
Run setup bash script, in a terminal, that will rename secret directories and files (that have been added to .gitignore):
```
chmod +x setup.sh # Make it executable

./setup.sh # To run it  
```

Step 3: Set up Terraform

Detailed instructions for setting up Terraform can be found in the Terraform directory.

Step 4: Create a Python virtual environment in the project directory, install the requirements, and activate it:

python -m venv venv

pip install -r requirements.txt

source venv/bin/activate # to activate

Step 5: Perform ETL using Data Load Tool (dlt)

Detailed instructions for performing the ETL process using the Data Load Tool (dlt) can be found in the dlt_data_load directory.

Step 6: Create External Tables in BigQuery

In this step, 3 external tables are created in BigQuery using the parquet files stored in the GCS bucket. You have two options to accomplish this:

Option 1: Using SQL

Open parquet_to_external_tables.sql, in the sql_external_tables folder, and copy the SQL commands.
Open the BigQuery dataset citi_bike_data in the Google Cloud Console and open a new Query tab.
Paste the SQL commands from the parquet_to_external_tables.sql file into the Query editor.
Edit the fields in the SQL commands with the necessary information, such as the project name and the gsutil URI for the .parquet files that are in the bucket folders.
Run the query, one command at a time, to create the external tables.

Option 2: Using Google Cloud Console GUI

Follow the step-by-step guide in this PDF document, skip to step 3, to create the external tables using the Google Cloud Console GUI within BigQuery.

The Dataset will be citi_bike_data and name each external table as follows: citi_bike_2019, citi_bike_2020, citi_bike_2023

Step 7: Data Transformations using dlt and dbt

In this step, data transformations are performed on the raw data stored in the external tables to create a single materialized table. The transformations will be done using the Data Load Tool (dlt) to create the pipelines for BigQuery and the dbt (Data Build Tool) runner to execute the transformations.

Detailed instructions and necessary files for this step can be found in the dbt_dlt_transformations directory.

Step 8: Deploy Pipelines with GitHub Actions using dlt

In this step, we'll use dlt to deploy our pipelines with GitHub Actions, enabling automated execution every quarter of the year.

Prerequisites:

The project should be hosted in a GitHub repository.

Steps:

Deploy the ETL pipeline:
- Navigate to the dlt_data_load directory in the terminal
- Run:
```
dlt deploy load_bike_data_to_gcs.py github-action --schedule "0 0 1 */3 *"
```
These commands create GitHub Actions workflow .yml files in the .github/workflows directory with the necessary environment variables.
Add secret values to GitHub:
- After running the deploy command, dlt will print out each Name and Secret pairs.
- Copy and paste each Name and Secret pair to the GitHub UI, which can be accessed at the link printed by the dlt deploy command (e.g., github.com/.../settings/secrets/actions).
Commit and push the changes to GitHub:
- Add and commit the changes to your local repository:
```
git add . && git commit -m 'pipeline deployed with github action'
```
- Push the changes to the GitHub repository:
```
git push origin
```

That's it! With this setup, the data pipelines will be automatically triggered and executed every three months, ensuring that your data remains up to date and ready for analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
dbt_dlt_transformations		dbt_dlt_transformations
dlt_data_load		dlt_data_load
images		images
sql_external_tables		sql_external_tables
terraform		terraform
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

agutiernc/nyc-citi-bike-insights

Folders and files

Latest commit

History

Repository files navigation

Biking in the Big Apple: Trends and Insights in New York City's Citi Bike Program

Project Overview 🚀

Tech Stack 🖥

Data Structure 🧱

Raw Tables

Number of rows

Final Materialized Table

Number of rows

Analysis 📊

Instructions to Replicate the Project 🧭

Step 1: Set up Google Cloud

Step 2: Clone this repo

Step 3: Set up Terraform

Step 4: Create a Python virtual environment in the project directory, install the requirements, and activate it:

Step 5: Perform ETL using Data Load Tool (dlt)

Step 6: Create External Tables in BigQuery

Option 1: Using SQL

Option 2: Using Google Cloud Console GUI

Step 7: Data Transformations using dlt and dbt

Step 8: Deploy Pipelines with GitHub Actions using dlt

About

Topics

Resources

Stars

Watchers

Forks

Languages