- This project involves the acquisition of USA Presc Medicare Datasets that contains almost 13_lakh records. These datasets are located on Local SFTP Server. The primary objective is to load these datasets from Local SFTP Server to the raw container in ADLS using Azure pipelines and then perform the transformation using Databricks.
- Azure Data Lake Gen2 Storage
- ADF Pipelines
- Databricks
- Azure Subscription
- Azure Key Vaults
- Data Factory
- Data Lake Storage Gen2
- Azure Databricks Cluster
- Local SFTP Server
-
Setup SFTP server in newer versions of windows
- Click windows button and search for “manage optional features”
- Click on “add a feature” and search for OpenSSH server and install it
- Now open "services" and start "OpenSSH server" and "OpenSSH Authentication agent":
- Create the appropriate firewall policy to expose the SFTP port 22 to local or remote systems if required
-
Using public key based authentication in SFTP
- Generate key using PuTTy Key Generator:
- It will create 2 keys: Private(.ppk), Public
- Copy the public key to C:\Users<username>.ssh\authorized_keys . If the file is not present, create a new file at this location.
- Generate key using PuTTy Key Generator:
-
Edit the sshd_config file of the SFTP server to configure public key based authentication
- sshd_config file is located at C:\ProgramData\ssh folder of the SFTP server
- Open it with a text editor like Notepad or VS code
- Modify the sshd_config file to make sure the following lines are present in the file:
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
Subsystem sftp internal-sftp
ChrootDirectory "C:\<folder_name>"
Two different datasets were ingested from Local SFTP Server into Datalake Gen2. They are -
- USA Presc Medicare Data (SFTP Server)
- US Cities Dimension Data (SFTP Server)
- Create a Linked Service To SFTP
- In Authentication type select SSH public key authentication
- In this you can directly upload private key. For this, first you have to convert this private ppk file to OpenSSH format because, ADF SFTP connector supports an RSA/DSA OpenSSH key. For conversion use, PuTTy:
- In PuTTy, click ‘Load’ and select your PPK file.
- Then, go to ‘Conversions’ in the menu and select ‘Export OpenSSH key’.
- Then upload that OpenSSH key to linked service.
- Or, you can store the private-key to Azure Key Vault Secrets. For this, first you have to encode OpenSSH key in Base-64 encoded. And then add this encoded key to secrets.
- In this you can directly upload private key. For this, first you have to convert this private ppk file to OpenSSH format because, ADF SFTP connector supports an RSA/DSA OpenSSH key. For conversion use, PuTTy:
- Create a Linked Service to Azure Key Vaults
- Create a Linked Service To Azure Data Lake storage (GEN2)
- Create a SFTP dataset, pointing to the source folder location.
- Create a ADLS Gen2 dataset, pointing to the sink location.
- Create a Pipeline, and use Copy Activity.
- Calculate No. of zips in usa_cities_dimension file.
- Count number of distinct prescribers and sum total_claim_count assigned to each city.
- Do not report a city if no prescribers is assigned to it.
- Apply filter on prescribers only from 20 to 50 years of experience and rank the prescribers based on their total claim count for each state.
- For Bronze to Silver transformation:
- Load the file into dataframe.
- Select required columns, check for null values, clean and rename columns.
- Write the clean files into silver layer in delta format.
- For silver to gold transformation:
- Load the delta files into dataframes.
- Perform the transformation and save it into gold layer container.
- Azure DataFactory
- Azure Databricks (Pyspark)
- Azure Storage Account
- Azure Data Lake Gen2
- Azure Key Vaults
- PuTTY Key Generator