Tokyo Olympics Data Analytics | Azure End-To-End Data Engineering Project

💻 Description

The Tokyo Olympic Data Analytics Project is a comprehensive solution for analyzing and visualizing the 2021 Tokyo Olympics data. It demonstrates how to build an end-to-end data analytics pipeline on the Azure cloud platform by integrating Azure Databricks, Azure Data Factory, and other Azure resources.

It involves ingesting raw Olympic data, transforming it into a suitable format, performing analysis, and creating insightful visualizations. Starting with a CSV on GitHub, the data is ingested into the Azure ecosystem via Azure Data Factory. It is initially stored in Azure Data Lake Storage Gen2 and then transformed in Azure Databricks. Once again housed in ADLS Gen2, the enriched data undergoes advanced analytics in Azure Synapse. The insights are finally visualized in Azure Synapse or Power BI, offering a comprehensive dataset view.

🌐 Architecture

📊 Dataset Used

The dataset for the Tokyo Olympics 2020 contains information about 11,000 athletes, 47 disciplines, and 743 teams, offering a comprehensive view of the event.

Source(Kaggle): 2021 Olympics in Tokyo

🔨 Azure Services Used

Azure Data Factory: For data ingestion from GitHub.
Azure Data Lake Storage Gen2: As the primary data storage solution.
Azure Databricks: For data transformation tasks.
Azure Synapse Analytics: To perform in-depth data analytics.

🔁 Workflow

1) Initial Setup

Create Azure Free Subscription acoount
Create a Resource Group 'tokyo-olympic-data' to house and manage all the Azure resources associated with this project.
Within the created resource group,set up a storage account. This is specifically configured to leverage Azure Data Lake Storage(ADLS) Gen2 capabilities.
Create a Container inside this storage account to hold the project's data. Two directories 'raw-data' and 'transformed-data' are created to store raw data and transformed data.

2) Data Ingestion using Azure Data Factory

Begin by creating an Azure Data Factory workspace within the previously established resource group.
After setting up the workspace, launch the Azure Data Factory Studio.
Upload the Tokyo Olympics dataset from kaggle to GitHub.
Within the studio, initialize a new data integration pipeline. Now use the task Copy Data to move data efficiently between various supported sources and destinations.
Configuring the Data Source with HTTP template as we are using http request to get the data from GitHub repo.
Establishing the Linked Service for source.
Configuring the File Format for and setting up the Linked Service Sink.
Repeat above steps to load all the datasets.
You can connect all the copy data activity together and run them all at once.

10. After the pipeline completes its execution, navigate to your Azure Data Lake Storage Gen2. Dive into the "raw-data" folder and validate that the files, like "athletes.csv", "medals.csv", etc., are present and populated with the expected data.

3) Data Transformation using Azure Databricks

Navigate to Azure Databricks within the Azure portal and create a workspace within the previously established resource group and launch it.
Configuring Compute in Databricks
Create a new notebook within Databricks and rename it appropriately, reflecting its purpose or the dataset it pertains to.
Establishing a Connection to Azure Data Lake Storage (ADLS)
Using the credentials (Client ID, Tenant ID, Secret), write the appropriate code in the Databricks notebook to mount ADLS.
Writing Data Transformations mount ADLS Gen2 to Databricks.
Writing Transformed Data to ADLS Gen2.

• Refer to the notebook below for the transformations and code used to mount ADLS Gen2 to Databricks.

Tokyo Olympics Transformation.ipynb

4) Setting Up and Using Azure Synapse Analytics

Creating a Synapse Analytics Workspace.
Within Workspace navigate to the "Data" section , choose "Lake Database" and create a Database "Tokyo_DB"
Creating Table from Data Lake from the Transformed Data folder within your ADLS Gen2 storage.

5) Performing Data Analysis on the Data

Create SQL script to Perform Exploratory data analysis using SQL. You can also use PowerBI to generate your analysis reports.

Query 1

Query 2

Query 3

Query 4

Query 5

Query 6

Query 7

• Refer to the SQL scripts used for data analysis. Tokyo Olympics SQL script.sql

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Images		Images
Output		Output
data		data
.gitattributes		.gitattributes
README.md		README.md
Tokyo Olympic Transformation.ipynb		Tokyo Olympic Transformation.ipynb
Tokyo Olympics SQL script.sql		Tokyo Olympics SQL script.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images

Images

Output

Output

data

data

.gitattributes

.gitattributes

README.md

README.md

Tokyo Olympic Transformation.ipynb

Tokyo Olympic Transformation.ipynb

Tokyo Olympics SQL script.sql

Tokyo Olympics SQL script.sql

Repository files navigation

Tokyo Olympics Data Analytics | Azure End-To-End Data Engineering Project

💻 Description

🌐 Architecture

📊 Dataset Used

🔨 Azure Services Used

🔁 Workflow

1) Initial Setup

2) Data Ingestion using Azure Data Factory

3) Data Transformation using Azure Databricks

4) Setting Up and Using Azure Synapse Analytics

5) Performing Data Analysis on the Data

About

Releases

Packages

Languages

sanithps98/Tokyo-Olympics-Data-Analytics

Folders and files

Latest commit

History

Repository files navigation

Tokyo Olympics Data Analytics | Azure End-To-End Data Engineering Project

💻 Description

🌐 Architecture

📊 Dataset Used

🔨 Azure Services Used

🔁 Workflow

1) Initial Setup

2) Data Ingestion using Azure Data Factory

3) Data Transformation using Azure Databricks

4) Setting Up and Using Azure Synapse Analytics

5) Performing Data Analysis on the Data

About

Resources

Stars

Watchers

Forks

Languages