Skip to content

JulienAganze/Citibike_NYC_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Bike project

Objective

The aim of this project was to come up with a complete ETL pipeline, meaning including data extraction, transformation as well as ingestion. And the data we worked with was related to citibike which is a company having a public bike shaing system in New York City in USA. So the aim was to design an entire data pipeline, with data for each being downloaded from citibike website(data), then uploaded into GCS before some transformation being applied to it usin DBT and finally some usefull visualization being obtained using Google Locker Studio.

Architecture

Problem statement

The work was devoted to answering the following questionsw: Where do Citi bikers ride? When do they ride? How far do they go? Which stations are most popuplar? What days of the week are most popular?

Dataset description

The dataset has the following columns:

  • ride_id: contains information related to a unique identifier associated with each ride
  • rideable_type: has information related to types of bikes being used for the ride
  • started_at: start time of the ride
  • ended_at: ride end time
  • start_station_name: start station name
  • start_station_id: id associated with each station
  • end_station_name: end station name
  • end_station_id: id associated with ecah stsation
  • start_lat: latitude associated with the trip starting position
  • start_lng: longitude associated with the trip starting position
  • end_lat: latitude associated with the trip ending position
  • end_lng: longitude associated with the trip ending position
  • member_casual: type of users using the bikes

Proposal

Technologies

What technologies are being used?

Repository organization

  • \airflow: flows files.
    #- \images: pictures.
  • \dbt: dbt files (dbt_project.yml, models, etc.).
  • \terraform: terraform files for the definition of the infrastructure to deploy.
  • \GCP_setup.md: instructions to configure cgp account.
  • \README.md: this document.

Infrastructure as code:

Use Terraform to create a bucket GCS and dataset in BQ

  • citibike_nyc bucket to store parquet files.

  • raw dataset for the ingestion into BigQuery.

  • development dataset for dbt cloud development environment.

  • production dataset for dbt cloud production environment.

    Orchestration:

Transformations using dbt:
Use dbt cloud to perform joins and aggregations on BQ.

  • Staging (materialized=view):

    • New Yoek City rides information: Create staged model from citibike_tripdata table in Big Query.
    • The output will be stg_tripdata model with the distance travelled being added as well as the latitude and longitude columns being concatenated.
  • Core (materialized=table):

    • fact_trips materialized model by stg_tripdata model.
  • Job:

    • For the convenient creation of the production dataset, a job dbt build will be created.
    • This job can be run manually (or scheduled) from dbt cloud.

    Dashboard:

Connect Google Data Studio to BQ dataset and design dashboard

Results

Dashboard

You can check my dashboard here: https://lookerstudio.google.com/s/iCKaAhBrFg0

Setup and running

Airflow will run as docker container. For data transformation:
Dbt cloud will be used to perform data transformation pipeline.

Your gcp account will be used and, unless you have google's welcome credit, it will have some cost. Your dbt cloud account will be used. Developer account is free.

If you wish to install the required tools in your own machine the instructions in setup_gcp.md will be a good starting point.

Run pipelines

  1. Setup your Google Cloud environment
export GOOGLE_APPLICATION_CREDENTIALS=<path_to_your_credentials>.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
gcloud auth application-default login
  1. Install all required dependencies into your environment
pip install -r requirements.txt
  1. Terraform cd terraform terraform init terraform plan -var="project=<your-gcp-project-id>" terraform apply -var="project=<your-gcp-project-id>"

About

Entire ETL pipeline project from data ingestion, transformation and finally analytics with Google Looker Studio

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published