Skip to content

A Re-do of Perth City Properties project using Azure Data Engineering technologies such as Azure Data Factory (ADF), Azure Data Lake Storage Gen2, Azure Blob Storage, Azure Databricks.

Notifications You must be signed in to change notification settings

helenamin/databricks_PerthProperties

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Databricks - Perth City Properties

Table of Contents

Introduction

Project Outline:

A Re-do of Perth City Properties project using Azure Data Engineering technologies such as Azure Data Factory (ADF), Azure Data Lake Storage Gen2, Azure Blob Storage, Azure Databricks.

In this project I'd like to:

  • Add Data orchestration using Azure Data Factory
  • perform data ingestion and transformation on the dataset using databricks
  • Implement ML models on databricks Machine learning and keep track of changes in ML notebooks and models use mlflow. Then register the best model using MLflow Model Registry

Prerequisites

An Azure subscription

Azure-Storage-Solutions

Creating Azure Data Lake Gen2 and containers

Azure Data Lake Gen2

Using Azure Storage explorer to interact with the storage account

Storage explorer

Uploading data into raw folder

Uploading data into raw folder

Access Control (IAM) role assignment

IAM - Storage

Data-Orchestration

Integrating data from Azure Data Lake Gen2 using Azure Data Factory.

Factory Resource Dataset

Creating dependency between pipelines to orchestrate the data flow

  • I've created 3 pipelines.

    1. One of them is to run the ingestion databricks notebook to get raw data and create bronze table and then ingest them into silver table.
    2. One to create gold table.
    3. This one to orchestratte the previous two pipeline. Using this I make sure To do the ingestion first and then Transformation

    Pipeline dependencies

Branching and Chaining activities in Azure Data Factory (ADF) Pipelines using control flow activities such as Get Metadata. If Condition, ForEach, Delete, Validation etc.

Branching and Chaining

Using Parameters and Variables in Pipelines, Datasets and LinkedServices to create a metadata driven pipelines in Azure Data Factory (ADF)

link service

parameters

Debugging the data pipelines and resolving issues.

Debug

Scheduling pipelines using trigger - Tumbling Window Trigger(for past time dataset) in Azure Data Factory (ADF)

Trigger

Creating ADF pipelines to execute Databricks Notebook activities to carry out transformations.

ADF - transformation

enable ADF git integration

ADF - git integration

Databricks

Creating Azure Databricks Workspace

Databricks Workspace2

Databricks Workspace2

Creating Databricks cluster

Databricks cluster

Mounting storage accounts using Azure Key Vault and Databricks Secret scopes

Mounting

Creating Databricks notebooks

notebooks

performing transformations using Databricks notebooks

ingestion

enable databricks git integration

git - databricks

DeltaLake

I've build a pipeline that runs databrick notebooks to reads data into Delta tables. Using the function below, it checks if the data needs to be merged or inserted.

deltaTable - merge

In this project, the pipeline reads the data from raw folder in perthpropdl (Azure Data Lake Gen2 storage) and creates a delta perth_bronze table. Then the bronze table is used in ingestion notebook to create perth_silver table. And finally gold tables are created using perth_silver table. IN this project, I didnt focus on gold table and data visulization after that. I just wanted to show the way it can be created in pipeline.

Source -> Bronze

bronze

Bronze -> Silver

silver

Silver -> Gold

delta- gold

gold

mlflow

MLflow is an open source platform for managing the end-to-end machine learning lifecycle. In this project I've used these models to predict Perth Property price ranges:

  • Linear Regression
  • Lasso
  • Ridge
  • Elasticnet

Again the main purpose of this project was to show how to use mlflow. So there will be room to improve the models.

Linear Regression

LR

Lasso

Lasso

Lasso - best alpha

Ridge

Ridge

Ridge - best alpha

Elasticnet

Elasticnet

Elasticnet - best alpha

MLflow Model Registry

It is a centralized model repository and I used it to register the best model base on current r2 in experiment UI:

experiment UI

Model Registry

DataSources

http://house.speakingsame.com/

https://www.onthehouse.com.au/

https://www.propertyvalue.com.au/

Technology

PythonLogo

Contributors

About

A Re-do of Perth City Properties project using Azure Data Engineering technologies such as Azure Data Factory (ADF), Azure Data Lake Storage Gen2, Azure Blob Storage, Azure Databricks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages