GitHub - dtmlinh/Traffic-Fatalities-HDFS: Looking at the fatality rates of traffic accidents in the US and which factors might impact these rates, leveraging several big data tools: AWS EMR cluster, HDFS, Hive, Spark, Hbase.

Project Description:

A project to look at the fatality rates of traffic accidents in the US and which factors might impact these rates. This project utitlizes several big data tools: AWS EMR cluster, HDFS, Hive, Spark, Hbase.

Contributors: Linh Dinh

Data Sources and Project Objectives

Bureau of transportation data:

Actual fatal accidents data for 2016-2018
Sampling non-fatal accidents data for 2016-2018 (NOT COMPLETE DATA COVERAGE)
I used these 2 data sources to try a Random Forest model predicting "fatal cases". I then identified a few factors that the Random Forest model (see 4. ML_spark.scala) suggests are "important":
- Weather
- Ligh condition: day vs. night
- Occur at junction or not
- Week day
- Hour of Day
- Etc.

Kaggle data for total US self-reported accidents data for 2016-2020: https://www.kaggle.com/sobhanmoosavi/us-accidents I used this Kaggle dataset to calculate the fatality rate (number of fatal accidents/number of total accidents) because the sampling non-fatal accidents data described above are not complete data coverage (i.e., randomly sample data from selected number of locations). I leveraged this Kaggle dataset for my denominator in the fatality rate calculation.

Usage

The final output shows by State and Year:

the fatality rate for serveral interesting conditions that might influence whether an accident is fatal or not: day vs. night time, at a junction, weather, etc.
average number of minutes injured persons arrive at the hospital
average number of hospitals within a 10 mile radius of the accident
share of state spending on highway investments and health investments

Application is packaged and deployed on AWS Single Server here using CodeDeploy.

Structure of the software

0. ingest_data.sh: Codes to ingest needed data
1. create_truth_tables.hql: HQL queries to create ground truth tables in Hive
2. batch_layer.scala: Spark codes to create batch layer tables in Hive
3. create_hbase_tables.hql: Codes to create hbase tables for serving layer
4. ML_spark.scala: ML codes to train a random forest model
folder app: Java and HTML codes to deploy app on AWS instance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

0. ingest_data.sh

0. ingest_data.sh

1. create_truth_tables.hql

1. create_truth_tables.hql

2. batch_layer.scala

2. batch_layer.scala

3. create_hbase_tables.hql

3. create_hbase_tables.hql

4. ML_spark.scala

4. ML_spark.scala

README.md

README.md

Transportation-Analyses.gif

Transportation-Analyses.gif

Repository files navigation

Project Description:

Data Sources and Project Objectives

Usage

Structure of the software

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
app		app
0. ingest_data.sh		0. ingest_data.sh
1. create_truth_tables.hql		1. create_truth_tables.hql
2. batch_layer.scala		2. batch_layer.scala
3. create_hbase_tables.hql		3. create_hbase_tables.hql
4. ML_spark.scala		4. ML_spark.scala
README.md		README.md
Transportation-Analyses.gif		Transportation-Analyses.gif

dtmlinh/Traffic-Fatalities-HDFS

Folders and files

Latest commit

History

Repository files navigation

Project Description:

Data Sources and Project Objectives

Usage

Structure of the software

About

Topics

Resources

Stars

Watchers

Forks

Languages