Traffic_Accidents_Predictive_Modeling

Description

There are 3.5 million U.S. traffic accident records (2016- June.2020) in the dataset. I focused on the 816,000 accident records from California to build and optimize a multi-class classification model to predict accident severity.

Severity is rated 1-4, based on the impact on traffic (1 = low impact, 4 = high impact).

Severity distribution was very imbalanced - about 70% Severity 2, 28% Severity, and 1% for each of Severity 1 and 4.

To avoid multicolliniearity, I removed columns that showed values resulting from an accident - 'end time', 'end lat/long', 'distance'.

Also, 94% of Severity 1 accidents were in 2020, so I got rid of all data from 2020. Otherwise, the oversampling method would amplify this disproportion and make 'Year' an unreasonably strong deciding factor in our prediction. Doing so decreased the number of Severity 1 accidents even more, leaving only 235 rows behind.

Because of this, training the model on Severity 1 became less effective, but it is, in most cases, more useful to be able to predict Severity 4 accidents than Severity 1.

There are 72 unique categorical values in the Weather_Condition column, which were simplified into 3 categories (good (0), mild(1), and bad(2)). After this, 54.32% of total was under 'good weather', 36.28% was 'mild', and 9.4% was 'bad'.

I used Decision Tree Classifier and Random Forest Classifier as my initial models to calculate the overall accuracy score as well as per-class accuracy scores. After using imblearn's SMOTE technique, the accuracy scores went down expectedly.

Lastly, I used GridSearchCV to find the most optimal parameters for the Random Forest model, which resulted in a final accuracy score of 0.76.

And here are the final feature importances:

Next Steps:

Try different variations of feature engineering:
- What if we grouped the 72 Weather Conditions differently instead of 'good', 'mild', and 'bad'?
- What if we group the severity levels into binary: 'low severity' vs. 'high severity'?
Zoom in on a focus area, such as the Bay Area, a single county, or city
Compare accuracy score with different models - XGBoost? Neural Network?

Technologies

Python packages: Pandas, Numpy, Matplotlib, Seaborn, Basemap, Sci-kit learn (sklearn), and Imbalanced-Learn (imblearn)

Back To The Top

Data Source

https://arxiv.org/abs/1906.05409 https://arxiv.org/abs/1909.09638

Back To The Top

Author Info

Email - edward.kim9280@gmail.com
LinkedIn - https://www.linkedin.com/in/edwardkim11/

Back To The Top

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.ipynb_checkpoints		.ipynb_checkpoints
img		img
.DS_Store		.DS_Store
Data_Cleaning_and_EDA.ipynb		Data_Cleaning_and_EDA.ipynb
Modeling_and_Analysis.ipynb		Modeling_and_Analysis.ipynb
README.md		README.md
acc_mp.png		acc_mp.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

img

img

.DS_Store

.DS_Store

Data_Cleaning_and_EDA.ipynb

Data_Cleaning_and_EDA.ipynb

Modeling_and_Analysis.ipynb

Modeling_and_Analysis.ipynb

README.md

README.md

acc_mp.png

acc_mp.png

Repository files navigation

Traffic_Accidents_Predictive_Modeling

Table of Contents

Description

Next Steps:

Technologies

Data Source

Author Info

About

Releases

Packages

Languages

eikim11/Traffic_Accidents_Predictive_Modeling

Folders and files

Latest commit

History

Repository files navigation

Traffic_Accidents_Predictive_Modeling

Table of Contents

Description

Next Steps:

Technologies

Data Source

Author Info

About

Resources

Stars

Watchers

Forks

Languages