Skip to content

bryan-md/Data-Science-Projects

Repository files navigation

Data Science Projects

This repository will house all code, data, and files related to peronal projects and my work in the Springboard Data Science Program. The following acts as a table of contents for the whole repository

Predicting Nitrogen Pollution in the Chesapeake Bay


Key Skills

  • Correlation Plots
  • Geospatial Analysis (Geopandas, QGIS, geopy)
  • Feature Engineering
  • Data Wrangling
  • SHAP & Permutation Feature Importance
  • Ensemble Modeling

WINNING ENTRY : A social good hackathon hosted by Booz Allen Hamilton & Chesapeake Monitoring Cooperative (CMC) to explore data monitoring the health of the Chesapeake Bay watershed. The challenge was to create a predictive model & correlation analysis for explaining patterns found from condition measures expressed by water quality indicator(s) assessments in the Chesapeake Bay watershed. The judging criteria were based upon robustness, scalability, and creativity by 16 expert judges representing leadership from 7 organizations and expertise across environmental science and modeling, data science, machine learning, and human-centered design. The task was to build a predictive model or correlation analysis for pollution in a section of the Chesapeake Bay using CMC monitoring data.

See the Devpost submission here

Watch the hackthebay submission

Seattle Terry Stop Data


Key Skills

  • Bokeh Plot Visuals / Dashboard Creation
  • Imbalanced Dataset Handling
  • Bootstrap Statistical Analysis
  • Multi Class Classification

Proof of Concept modeling to determine if the race of a stopped subject can be predicted by the race of the stopping officer. Also, determine a Frisk be predicted based upon the demographics of the officer and subject.

Clustering Methods


Key Skills

  • K-Means
  • PCA - Principle Component Analysis
  • Silhouette Method
  • Elbow Sum of Squares Method

Mini project on customer segmentation and identifying unkown relationships among customers. The more you know of your customers, the more you can personalize your service! The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent) and transaction level data from customers (which offer customers responded to and what they bought).

Exploratory Data Analysis' (EDA)


Several EDA's performed on varying data categories.

Hospital Readmittance performs a statistical analysis on a previously done analysis to critique its validity. Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. The analysis focuses on Hospital readmittance data analysis.

Human Temperature EDA uses bootstrap statistics to determine the true average temperature of the human body in both male and females.

Racial Discrimination performs a statistical analysis on if race has a meaningful impact on the callback rate of callbacks for resumes.

Key Skills

  • Central Limit Theorem
  • Statistical Analysis
  • Data Visualization
  • z-test
  • t-test
  • Margin of Error (MOE)
  • Chi-Squared Test
  • Bootstrap Statistics
  • Hypothesis Testing

Machine Learning Algorithms

Key Skills

  • Logistic Regression
  • Hyperparameter Tuning
  • K-Fold CV
  • Linear Regression
  • Metric Evaluation
  • Residual Plot
  • Influence Plot
  • Naive Bayes
  • NLP
  • Tokenization
  • TF-IDF
  • n-grams

Performing several Machine Learning Algorithms in miniprojects such as: Labeling an obersvation as either male or female based on height and weight data (Logistic Regression), Regression Price Estimate on Boston Housing data using Linear Regression, and predicting rotten/fresh from critic reviews with Naive Bayes Models

PYSPARK

Performing several exercises utlitizing MapReduce Pyspark (RDD) with a touch of MLlib

Key Skills

  • Pyspark
  • RDD
  • Spark Dataframes

SQL

Key Skills

  • SQL
  • Time Series Analysis
  • sqlalchemy

This is a SQL project to utilize SQLAlchemy & PyMysql to connect to a mysql server and import data using SQL in python.

JSON

Key Skills

  • JSON Manipulation and Extraction
  • Applied Plotting and Charting

An exercise of data extraction and exploration utilizing a JSON data source

Take Home Data Challenges

Defining an "adopted user" as a user who has logged into a product on three separate days in at least one seven-day period, identify which factors predict future user adoption.

  • Exploratory Data Analysis
  • Experiement and metrics design
  • Predictive modeling and recomendations

Key Skills

  • Full Stack Data Scientist