GitHub - waqarg2001/Formula1-Insights-DE: Formula 1 race data engineering project which utilises azure services and databricks to ingest and analyse the data.

Utilisation of Azure Cloud Services to architect and orchestrate data pipeline to perform ETL on Formula 1 racing dataset extracted from Ergast Developer API.

Overview • Tools • Architecture • ERD • Support • License

Overview

The Ergast Developer API is an experimental web service that provides a historical record of motor racing data for non-commercial purposes. The API provides data for the Formula One series, from the beginning of the world championships in 1950 until now.

This project showcases a seamless data journey facilitated by Azure services. It begins with data extraction from the Ergast Developer API and harnesses Azure components such as Azure Active Directory, Service Principal, Azure Databricks, Key Vault, Azure Data Factory, and Azure Data Lake Gen2 to orchestrate this process efficiently. Within Azure Databricks, powered by Apache Spark, data undergoes the ETL (Extract, Transform, Load) process. The data begins its journey in the 'ingestion' folder, where it is initially received. It then proceeds to the 'transformations' folder, where it is refined and enhanced. Finally, the data finds its destination in the 'analysis' folder, where it is carefully organized and prepared for analysis. The orchestration of this data journey is managed through Azure Data Factory, representing a structured and efficient approach to data engineering and analysis.

The repository directory structure is as follows:

├── README.md          <- The top-level README for developers using this project. 
| 
├── Raw           <- Contains script to define table schemas
| 
├── Transformations         <- Scripts to aggregate and transform data
│  
├── analysis         <- Basic analysis of data from the transformations folder.  
| 
│ 
├── include                <- Configuration folder 
│   ├── common_functions.py    <- Common functions used throughout the ETL process.
│   │ 
│   ├── configuration.py       <- Houses configuration settings such as variables.
│      
|         
|
├── ingestion          <- Ingestion scripts for data files from ADLS Gen 2.
│      
├── resources          <- Resources for readme file.
|
├── set-up             <- Script for mounting ADLS Gen 2 to Databricks
|         
├── utils              <- SQL scripts for incremental load.

Tools

To build this project, the following tools were used:

Azure Databricks
Azure KeyVault
Azure Active Directory
Azure DataLake Gen 2
Azure Data Factory
Pyspark
SQL
Git

Architecture

The architecture of this project is inspired by the following, taken from Azure Architecture Center.

ERD

The database structure is shown in the following ER Diagram and explained in the Database User Guide.

Support

If you have any doubts, queries or, suggestions then, please connect with me on any of the following platforms:

License

This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator. If you remix, adapt, or build upon the material, you must license the modified material under identical terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw

Raw

Transformations

Transformations

analysis

analysis

include

include

ingestion

ingestion

resources

resources

set-up

set-up

utils

utils

README.md

README.md

Repository files navigation

Utilisation of Azure Cloud Services to architect and orchestrate data pipeline to perform ETL on Formula 1 racing dataset extracted from Ergast Developer API.

Overview

Tools

Architecture

ERD

Support

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Raw		Raw
Transformations		Transformations
analysis		analysis
include		include
ingestion		ingestion
resources		resources
set-up		set-up
utils		utils
README.md		README.md

waqarg2001/Formula1-Insights-DE

Folders and files

Latest commit

History

Repository files navigation

Utilisation of Azure Cloud Services to architect and orchestrate data pipeline to perform ETL on Formula 1 racing dataset extracted from Ergast Developer API.

Overview

Tools

Architecture

ERD

Support

License

About

Topics

Resources

Stars

Watchers

Forks

Languages