CryptoScout

A data pipeline deployed on Google Cloud that extracts cryptocurrency data for analytics. Integrates tools such as Airflow, Spark, dbt, Docker, Terraform, and various GCP services!

Description

Objective

The main objective is to deliver a pipeline which automates the daily extraction of cryptocurrency data and serves it up for analytical workloads (OLAP). At a high level, the batch data pipeline consumes source data, stores it in a data lake, transforms it, and materializes dimensionally modelled tables in a data warehouse suited for analytical reporting. Lastly, a dashboard is connected to the data warehouse for visualization and analysis of the modelled data.

The following secondary design and personal objectives were kept in mind when designing the pipeline:

Ease of scalability with future source data increases
Minimize GCP cloud service costs
Learn as much as possible! Gain exposure to different tools and technologies while integrating them together

Source Data

All source data used in this project is extracted from different CoinCap API endpoints. While some endpoints used offer historical data, others offer only snapshotted time-of-request data. CoinCap itself is a tool which collects real-time cryptocurrency exchange data from multiple markets.

The pipeline was run for approximately a month (May 2023), with historical data backfilled from Jan 2022.
Due to API rate limitations, as well as an effort to keep GCP costs low, only a limited subset of cryptocurrency assets and exchanges are considered for the limited purposes of this project.
Unfortunately, CoinCap stopped supporting the /candles endpoint in early 2023.

Tools / Technologies

Type / Purpose	Tooling
Cloud Infrastructure	Google Cloud Platform (GCP)
Infrastructure as Code (IaC)	Terraform
Orchestration	Apache Airflow
Containerization	Docker, Docker Compose
REST API data ingestion service	FastAPI
Data Quality Validation	Pydantic, dbt
Data Transformation / Modelling	Apache Spark, dbt
Data Lake	Google Cloud Storage (GCS)
Data Warehouse	BigQuery
Data Visualization	Looker Studio

Data Pipeline Architecture

Airflow DAG

dbt DAG

Dimensional Model

A Kimball methodology was applied to dimensionally model the data in the data warehouse. An ERD depicting the relationships between fact and dimension tables is presented below:

Analytics Dashboard

Click here to interact with the dashboard!

Getting Started

The steps below assume you are developing in a Linux environment.

Setup & Prerequisites

Create a new GCP project
Local system installation of:
- Terraform available on your PATH
- gcloud CLI
- Docker and Docker Compose
- Python 3

Installing Dev Tools (Optional)

This project uses pipx to install dev tools from PyPI in isolated Python environments. The following dev tools are installed with pipx:

black - Python code formatter
flake8 - Python linter
isort - Python import sorter
mypy - Python static type checker
sqlfluff - SQL linter and formatter

Install pipx by running the following command in the current directory:

make install-pipx

Once pipx is installed, the dev tools listed above can be installed as follows:

make pipx-devtools

Project-wide code formatting, typing, and linting can then be applied:

make ci

Deploying to GCP

Data pipeline deployment to Google Cloud is fully defined and managed by Terraform. For details on how to set up GCP infrastructure with Terraform click here.

Potential Improvements

Deploying a Cloud Run service is overkill for data ingestion. While developing, containerizing, and deploying an API service was a good learning experience, it would have been simpler to develop Cloud Run Jobs or Cloud Functions to ingest the data.
Dataproc Serverless startup and shutdown of compute adds overhead time for each DAG run. Deploying a Dataproc cluster would resolve this for a higher cost if Spark jobs needed to be run.
Replace the Spark jobs with Python processes; Spark is overkill for the small volume of daily data dealt with in this project.
Build fact tables incrementally instead of doing a full refresh to reduce dbt build time as tables accumulate more data over time.
Deploy a production version of Airflow via Helm Charts (GKE deployment) or use a managed version of Airflow (Cloud Composer, Astronomer, etc.)
Add more integration tests and end-to-end pipeline tests
Add more data validation and quality checks
Improve data pipeline monitoring and alerting
Implement CI/CD

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.vscode		.vscode
airflow		airflow
batch_ingest		batch_ingest
dbt		dbt
dev-tools		dev-tools
images		images
spark_batch		spark_batch
terraform		terraform
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg

jason-cls/cryptoscout

Folders and files

Latest commit

History

Repository files navigation

CryptoScout

Description

Objective

Source Data

Tools / Technologies

Data Pipeline Architecture

Dimensional Model

Analytics Dashboard

Getting Started

Setup & Prerequisites

Installing Dev Tools (Optional)

Deploying to GCP

Potential Improvements

About

Topics

Resources

Stars

Watchers

Forks

Languages