Skip to content

ETL Pipeline to analyze Flight Departures (domestic) in the U.S. in 2022. Questions about flight delays, whether influence, airlines reliability.

Notifications You must be signed in to change notification settings

juansevargasc/2022-Departures

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

US Flight Departures - 2022

Newark Airport

Newark, New Jersey @nicolasjehly

👨🏻‍💻

Execution

  • Environment
conda create --name <env> --file requirements.txt
# or
pip install -r requirements.txt
  • Main python file
python src/main.py

Description

This project aims to explore the US flight departures features in 2022. This will be made through the analysis of weather conditions, cancellations, dates, locations and carriers among others. Nevertheless, it will feature first a ETL pipeline to preprocess different data sources and then load into a OLAP database, for BI consumption.

Table of contents

Data Engineering Stage

Objectives

  • Extract data from different sources. In this case it comes from 5 CSV Files but two of them are worked out to be in a Relational Database and the other to be a JSON file so simulate different types of sources. See prework.
  • Design a data schema that allows to query data for BI purposes
  • Create an ETL Pipeline.
  • Clean data by choosing which NaN (empty) values should be dropped.
  • Standardizing names, making conventions.
  • Testing and enforcing data types and schemas.
  • Build a Star architecture.

Data Analysis Stage

Objectives

  • Make questions interesting questions such as:
    • Is there a correlation between delays and wheather?
    • How many flights did a certain airline make during the year?
    • What's the most common route? Is there an impact from wheather in a route?
  • Make a Data exploration and characterize some columns.
  • Make some Statistics:
    • What's the average of flights per day?
    • How many flights are delayed per day?
    • Does the wheather events follow a normal distribution? Another type of distribuition?

1. Data Engineering Stage

Introduction

The project aims to analyze the files that are given in this dataset: 2022 U.S. Domestic Flights Departures

Kaggle Dataset Flight Dep.

Author: Jacky Luo

Prework

The prework is made to take some original files and export them to SQL database and a JSON file to simulate we have different data sources in the project. See more in Prework


Documentation of Stages

Star Schema for project.

Final Dim - Fact Schema

About

ETL Pipeline to analyze Flight Departures (domestic) in the U.S. in 2022. Questions about flight delays, whether influence, airlines reliability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published