Skip to content

jenniferliangc/DataSciencePortfolio

Repository files navigation

DataSciencePortfolio

A collection of Data Science and Data Analysis projects to demonstrate my skill set.

Contents

  • Machine Learning

    Using Python and Pandas:

    • Bank Marketing Prediction: Trained several machine learning models to predict if a client will subscribe to a term deposit based on marketing campaigns. Some of the models trained are: logistic regression, K-nearest neighbor, decision tree, random forest (optimized using cross-validation), and neural networks
    • Predicting Housing Prices: Built a linear model to predict house sale prices from a dataset with over 500,000 datapoints and 61 variables

    Using R:

    • Cities: Used a public dataset with agreggated data for each of Brazil's cities such as population, gdp, number of cars, among others, to create a linear model that predicts the population for each city
    • Baseball Analysis: Used libraries such as infer, ggplot2, and tidyverse to perform exploratory data analysis and create a linear model that predicts baseball wins using different independent variables
  • Data Analysis

    Using Python, Pandas, Matplotlib, Seaborn:

    • Bike Sharing: Analyzed bike sharing data from Washington D.C. to gain insight about user's behavior
    • Text Analysis Using Twitter: Produced tweets' sentiment score using VADER lexicon to predict how positive or negative a tweet is
    • Tuscan RFM Marketing Analysis: Took a merchant's dataset and split customers into deciles to identify most profitable customers based on their recency, frequency and monetary values. Calculated gross profit and ROI across all customer segments

    Using SQL

    • IMDb Analysis: Used SQL to analyze an IMDb dataset. It contains 4 tables with over 121k rows and 23 variables. Leveraged advance SQL commands such as JOIN, WITH, and CASE to get information from multiple tables and answer questions such as "Who are the top 10 most prolific movie actors?" and "How does film length relate to ratings?"

    Using R:

    • Flights: Took a data set with 113k rows and 19 variables with information about departing flights in the US and analyze the impact that COVID-19 had on flights and some general statistics
    • People's Park: Used a data set provided by the Chancellor's Office at University of California, Berkeley with anonymized responses from 1,250 students from a survey designed to get some data on student's perspective on the ongoing controversy of the People's Park project and summarized the survey responses
    • Hypothesis Testing Using P-Values: Permuted datasets and the desired statistic to compute p-values and reject or accept null hypothesis for different datasets using the infer library

Releases

No releases published

Packages

No packages published