Skip to content

Deployable AWS data platform to process powerlifting data extracted from openpowerlifting.org.

Notifications You must be signed in to change notification settings

jack-white9/openpowerlifting-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenPowerlifting Data Pipeline

A data pipeline extracting data from Open Powerlifting records.

OpenPowerlifting-Dashboard

Table of Contents

Project Overview

This project contains a data pipeline that extracts, transforms, and loads powerlifting data into a data lake hosted on Amazon S3.

Features:

  • ETL from API to curated data lake
  • Containerised Airflow environment for orchestration
  • Deployable data platform using Infrastructure as Code with Terraform
  • Data Catalog using Glue to enable ad-hoc SQL queries on curated parquet data with Amazon Athena

Extracting API data

Data is extracted daily from the Open Powerlifting API and loaded into a raw S3 landing zone in CSV format. For cost-saving purposes, raw data will be retained for 10 days before automatic deletion.

Transforming CSV data

CSV data from the raw S3 bucket is transformed with pandas by modifying columns and values to improve clarity in downstream reporting.

Loading Parquet data

Transformed data is converted into compressed Parquet format to reduce storage costs and improve read execution.

Analysing Parquet data with Amazon Athena

Terraform will create a Glue Database by scanning the curated S3 bucket with a Glue Crawler, enabling ad-hoc SQL queries with Amazon Athena.

Running Locally

This repository provides a Docker Compose setup for running Apache Airflow locally. It allows you to quickly set up an Airflow environment for development, testing, or demonstration purposes.

Requirements

  • Docker
  • Docker Compose (V2)

Getting Started

  1. Clone this repository:
git clone git@github.com:jack-white9/openpowerlifting-data-pipeline.git
cd openpowerlifting-data-pipeline
  1. Create the required AWS infrastructure:
cd terraform
terraform apply
  1. Add AWS credentials to Docker environment
touch airflow/.env
echo "AWS_ACCESS_KEY_ID=<your aws access key id>" >> airflow/.env
echo "AWS_SECRET_ACCESS_KEY=<your aws secret access key>" >> airflow/.env
  1. Start local Airflow services using Docker Compose:
cd airflow
docker compose up --build -d
  1. Access the Airflow web interface:

Open a web browser and go to http://localhost:8080 to view the Airflow web UI.

Shutting Down

To remove AWS infrastructure, use:

terraform destroy

To stop and remove Airflow containers, use:

docker compose down

Troubleshooting

  • If you encounter any issues, refer to the Apache Airflow documentation for troubleshooting tips and guidance.
  • Check the logs of Airflow services for error messages:
docker compose logs <service_name>

About

Deployable AWS data platform to process powerlifting data extracted from openpowerlifting.org.

Topics

Resources

Stars

Watchers

Forks