Skip to content

antoprince001/skill_etl

Repository files navigation

skill_etl

ETL Data pipeline project developed to process online job posts using Airflow, Spark, Postgres and Tableau.

Project Architecture

elt

Components of the ETL Pipeline

Extract

Job Posts is fetched from Google Jobs API based on job role and the data for each job role is stored as individual JSON file in raw folder.

Transform

JSON files for each job role is read using PySpark and the data is processed to fetch required data columns in required format. The cleaned data is then stored in parquet and CSV format in the processed folder.

Load

The CSV files are loaded into PostgreSQL database using Airflow copy_expert operation using Postgres Hook.

Pipelines

The project uses the following two DAGs (Directed Acyclic Graph)

  • Skill ETL setup pipeline

Setup pipelines runs the SQL script to create destination table in postgres db if it does not exist.

elt

  • Skill ETL pipeline

This pipeline extract the Extract, Transform and Load tasks to process the data from API ingestion layer to data storage layer.

elt

Reporting

A tableau dashboard is created in order to visualise the insights from the data stored in PostgreSQL. A PL/SQL function is created to convert the postgres table to csv which is then converted to excel file. This excel file is used as the source for Tableau visualisation. It is stored in reporting folder.

Link to Dashboard - https://public.tableau.com/app/profile/antony.prince.j/viz/skill_etl/Dashboard1?publish=yes

Dashboard 1

Configuration

The Config file in the project under config/ is used to configure the job roles to be queried and the path in which each stage of the pipeline should store the data during the Airflow DAG execution.

{
    "job_roles" : 
        [ 
            "data%20engineer%20india",
            "backend%developer%20india",
            "blockchain%developer%20india",
            "data%scientist%20india",
            "fullstack%developer%20india"
        ],
    "project_path": "/Users/antonyprincej/airflow/dags/skill_graph/",
    "raw" : {
        "folder" : "raw",
        "type" : "json"
    },
    "processed" : {
        "folder" : "processed",
        "type" : ["parquet","csv"]
    },
    "destination" : {
        "type" : "sql"
    }

}

About

ETL Data pipeline project developed to process online job posts using Airflow, Spark, Postgres and Tableau.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published