Skip to content

Applying SWE best practices to my own data pipeline using a modern analytics stack

Notifications You must be signed in to change notification settings

jairus-m/StravaELTPipeline

Repository files navigation

Strava ELT

StravaELT

Demo Project

The goal of this project is to implement the skeleton of a robust ELT pipeline. Things to consider are:

  • version control
  • development flow
  • file project structure
  • unit testing
  • logging
  • documentation
  • virtual enviornments/dependency management
  • orchestration
  • general best practices for data engineering
  • containerization
  • supporting downstream analytics/ML

Overall ELT pipeline

Strava API --> Python --> BigQuery + dbt --> Tableau/ML in Jupyter Notebook

  • light data transformation with Pandas
  • orchestration through Google Cloud Services
  • data storage through BigQuery
  • final data transformations (dimensional modeling + OBT) for downstream analytics through dbt
  • Containerization via Docker
  • ELT job notifications sent through Slack
  • Downstream analytics supported by this pipeline

Deployment

  • Python application is containerized and pushed to Google Cloud Artifact Registry
  • Container is then deployed on Cloud Run Jobs at a set schedule
  • Every midnight, the ELT pipeline is ran, checking for new data to upload to BigQuery
  • At job completion, a Slack notification with job meta data and success status is sent

Folder structure

  • configs : .yml file with API tokens, db user/password, ELT params
  • src : source code
  • tests : unit tests

About

Applying SWE best practices to my own data pipeline using a modern analytics stack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published