A data pipeline extracting data from Open Powerlifting records.
This project contains a data pipeline that extracts, transforms, and loads powerlifting data into a data lake hosted on Amazon S3.
Features:
- ETL from API to curated data lake
- Containerised Airflow environment for orchestration
- Deployable data platform using Infrastructure as Code with Terraform
- Data Catalog using Glue to enable ad-hoc SQL queries on curated parquet data with Amazon Athena
Data is extracted daily from the Open Powerlifting API and loaded into a raw S3 landing zone in CSV format. For cost-saving purposes, raw data will be retained for 10 days before automatic deletion.
CSV data from the raw S3 bucket is transformed with pandas by modifying columns and values to improve clarity in downstream reporting.
Transformed data is converted into compressed Parquet format to reduce storage costs and improve read execution.
Terraform will create a Glue Database by scanning the curated S3 bucket with a Glue Crawler, enabling ad-hoc SQL queries with Amazon Athena.
This repository provides a Docker Compose setup for running Apache Airflow locally. It allows you to quickly set up an Airflow environment for development, testing, or demonstration purposes.
- Docker
- Docker Compose (V2)
- Clone this repository:
git clone git@github.com:jack-white9/openpowerlifting-data-pipeline.git
cd openpowerlifting-data-pipeline
- Create the required AWS infrastructure:
cd terraform
terraform apply
- Add AWS credentials to Docker environment
touch airflow/.env
echo "AWS_ACCESS_KEY_ID=<your aws access key id>" >> airflow/.env
echo "AWS_SECRET_ACCESS_KEY=<your aws secret access key>" >> airflow/.env
- Start local Airflow services using Docker Compose:
cd airflow
docker compose up --build -d
- Access the Airflow web interface:
Open a web browser and go to http://localhost:8080 to view the Airflow web UI.
To remove AWS infrastructure, use:
terraform destroy
To stop and remove Airflow containers, use:
docker compose down
- If you encounter any issues, refer to the Apache Airflow documentation for troubleshooting tips and guidance.
- Check the logs of Airflow services for error messages:
docker compose logs <service_name>