A data pipeline deployed on Google Cloud that extracts cryptocurrency data for analytics. Integrates tools such as Airflow, Spark, dbt, Docker, Terraform, and various GCP services!
The main objective is to deliver a pipeline which automates the daily extraction of cryptocurrency data and serves it up for analytical workloads (OLAP). At a high level, the batch data pipeline consumes source data, stores it in a data lake, transforms it, and materializes dimensionally modelled tables in a data warehouse suited for analytical reporting. Lastly, a dashboard is connected to the data warehouse for visualization and analysis of the modelled data.
The following secondary design and personal objectives were kept in mind when designing the pipeline:
- Ease of scalability with future source data increases
- Minimize GCP cloud service costs
- Learn as much as possible! Gain exposure to different tools and technologies while integrating them together
All source data used in this project is extracted from different CoinCap API endpoints. While some endpoints used offer historical data, others offer only snapshotted time-of-request data. CoinCap itself is a tool which collects real-time cryptocurrency exchange data from multiple markets.
- The pipeline was run for approximately a month (May 2023), with historical data backfilled from Jan 2022.
- Due to API rate limitations, as well as an effort to keep GCP costs low, only a limited subset of cryptocurrency assets and exchanges are considered for the limited purposes of this project.
- Unfortunately, CoinCap stopped supporting the
/candles
endpoint in early 2023.
Type / Purpose | Tooling |
---|---|
Cloud Infrastructure | Google Cloud Platform (GCP) |
Infrastructure as Code (IaC) | Terraform |
Orchestration | Apache Airflow |
Containerization | Docker, Docker Compose |
REST API data ingestion service | FastAPI |
Data Quality Validation | Pydantic, dbt |
Data Transformation / Modelling | Apache Spark, dbt |
Data Lake | Google Cloud Storage (GCS) |
Data Warehouse | BigQuery |
Data Visualization | Looker Studio |
A Kimball methodology was applied to dimensionally model the data in the data warehouse. An ERD depicting the relationships between fact and dimension tables is presented below:
Click here to interact with the dashboard!
The steps below assume you are developing in a Linux environment.
- Create a new GCP project
- Local system installation of:
- Terraform available on your PATH
- gcloud CLI
- Docker and Docker Compose
- Python 3
This project uses pipx to install dev tools from PyPI in isolated Python environments. The following dev tools are installed with pipx:
- black - Python code formatter
- flake8 - Python linter
- isort - Python import sorter
- mypy - Python static type checker
- sqlfluff - SQL linter and formatter
Install pipx by running the following command in the current directory:
make install-pipx
Once pipx is installed, the dev tools listed above can be installed as follows:
make pipx-devtools
Project-wide code formatting, typing, and linting can then be applied:
make ci
Data pipeline deployment to Google Cloud is fully defined and managed by Terraform. For details on how to set up GCP infrastructure with Terraform click here.
- Deploying a Cloud Run service is overkill for data ingestion. While developing, containerizing, and deploying an API service was a good learning experience, it would have been simpler to develop Cloud Run Jobs or Cloud Functions to ingest the data.
- Dataproc Serverless startup and shutdown of compute adds overhead time for each DAG run. Deploying a Dataproc cluster would resolve this for a higher cost if Spark jobs needed to be run.
- Replace the Spark jobs with Python processes; Spark is overkill for the small volume of daily data dealt with in this project.
- Build fact tables incrementally instead of doing a full refresh to reduce dbt build time as tables accumulate more data over time.
- Deploy a production version of Airflow via Helm Charts (GKE deployment) or use a managed version of Airflow (Cloud Composer, Astronomer, etc.)
- Add more integration tests and end-to-end pipeline tests
- Add more data validation and quality checks
- Improve data pipeline monitoring and alerting
- Implement CI/CD