Skip to content

An end-to-end data pipeline deployed on GCP that extracts cryptocurrency data for analytics.

Notifications You must be signed in to change notification settings

jason-cls/cryptoscout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CryptoScout

A data pipeline deployed on Google Cloud that extracts cryptocurrency data for analytics. Integrates tools such as Airflow, Spark, dbt, Docker, Terraform, and various GCP services!

Description

Objective

The main objective is to deliver a pipeline which automates the daily extraction of cryptocurrency data and serves it up for analytical workloads (OLAP). At a high level, the batch data pipeline consumes source data, stores it in a data lake, transforms it, and materializes dimensionally modelled tables in a data warehouse suited for analytical reporting. Lastly, a dashboard is connected to the data warehouse for visualization and analysis of the modelled data.

The following secondary design and personal objectives were kept in mind when designing the pipeline:

  1. Ease of scalability with future source data increases
  2. Minimize GCP cloud service costs
  3. Learn as much as possible! Gain exposure to different tools and technologies while integrating them together

Source Data

All source data used in this project is extracted from different CoinCap API endpoints. While some endpoints used offer historical data, others offer only snapshotted time-of-request data. CoinCap itself is a tool which collects real-time cryptocurrency exchange data from multiple markets.

  • The pipeline was run for approximately a month (May 2023), with historical data backfilled from Jan 2022.
  • Due to API rate limitations, as well as an effort to keep GCP costs low, only a limited subset of cryptocurrency assets and exchanges are considered for the limited purposes of this project.
  • Unfortunately, CoinCap stopped supporting the /candles endpoint in early 2023.

Tools / Technologies

Type / Purpose Tooling
Cloud Infrastructure Google Cloud Platform (GCP)
Infrastructure as Code (IaC) Terraform
Orchestration Apache Airflow
Containerization Docker, Docker Compose
REST API data ingestion service FastAPI
Data Quality Validation Pydantic, dbt
Data Transformation / Modelling Apache Spark, dbt
Data Lake Google Cloud Storage (GCS)
Data Warehouse BigQuery
Data Visualization Looker Studio

Data Pipeline Architecture

Airflow DAG

dbt DAG

Dimensional Model

A Kimball methodology was applied to dimensionally model the data in the data warehouse. An ERD depicting the relationships between fact and dimension tables is presented below:

Analytics Dashboard

Click here to interact with the dashboard!

Getting Started

The steps below assume you are developing in a Linux environment.

Setup & Prerequisites

  1. Create a new GCP project
  2. Local system installation of:

Installing Dev Tools (Optional)

This project uses pipx to install dev tools from PyPI in isolated Python environments. The following dev tools are installed with pipx:

  • black - Python code formatter
  • flake8 - Python linter
  • isort - Python import sorter
  • mypy - Python static type checker
  • sqlfluff - SQL linter and formatter

Install pipx by running the following command in the current directory:

make install-pipx

Once pipx is installed, the dev tools listed above can be installed as follows:

make pipx-devtools

Project-wide code formatting, typing, and linting can then be applied:

make ci

Deploying to GCP

Data pipeline deployment to Google Cloud is fully defined and managed by Terraform. For details on how to set up GCP infrastructure with Terraform click here.

Potential Improvements

  • Deploying a Cloud Run service is overkill for data ingestion. While developing, containerizing, and deploying an API service was a good learning experience, it would have been simpler to develop Cloud Run Jobs or Cloud Functions to ingest the data.
  • Dataproc Serverless startup and shutdown of compute adds overhead time for each DAG run. Deploying a Dataproc cluster would resolve this for a higher cost if Spark jobs needed to be run.
  • Replace the Spark jobs with Python processes; Spark is overkill for the small volume of daily data dealt with in this project.
  • Build fact tables incrementally instead of doing a full refresh to reduce dbt build time as tables accumulate more data over time.
  • Deploy a production version of Airflow via Helm Charts (GKE deployment) or use a managed version of Airflow (Cloud Composer, Astronomer, etc.)
  • Add more integration tests and end-to-end pipeline tests
  • Add more data validation and quality checks
  • Improve data pipeline monitoring and alerting
  • Implement CI/CD