DataSciencePortfolio

A collection of Data Science and Data Analysis projects to demonstrate my skill set.

Machine Learning

Using Python and Pandas:
- Bank Marketing Prediction: Trained several machine learning models to predict if a client will subscribe to a term deposit based on marketing campaigns. Some of the models trained are: logistic regression, K-nearest neighbor, decision tree, random forest (optimized using cross-validation), and neural networks
- Predicting Housing Prices: Built a linear model to predict house sale prices from a dataset with over 500,000 datapoints and 61 variables
Using R:
- Cities: Used a public dataset with agreggated data for each of Brazil's cities such as population, gdp, number of cars, among others, to create a linear model that predicts the population for each city
- Baseball Analysis: Used libraries such as infer, ggplot2, and tidyverse to perform exploratory data analysis and create a linear model that predicts baseball wins using different independent variables
Data Analysis

Using Python, Pandas, Matplotlib, Seaborn:
- Bike Sharing: Analyzed bike sharing data from Washington D.C. to gain insight about user's behavior
- Text Analysis Using Twitter: Produced tweets' sentiment score using VADER lexicon to predict how positive or negative a tweet is
- Tuscan RFM Marketing Analysis: Took a merchant's dataset and split customers into deciles to identify most profitable customers based on their recency, frequency and monetary values. Calculated gross profit and ROI across all customer segments
Using SQL
- IMDb Analysis: Used SQL to analyze an IMDb dataset. It contains 4 tables with over 121k rows and 23 variables. Leveraged advance SQL commands such as JOIN, WITH, and CASE to get information from multiple tables and answer questions such as "Who are the top 10 most prolific movie actors?" and "How does film length relate to ratings?"
Using R:
- Flights: Took a data set with 113k rows and 19 variables with information about departing flights in the US and analyze the impact that COVID-19 had on flights and some general statistics
- People's Park: Used a data set provided by the Chancellor's Office at University of California, Berkeley with anonymized responses from 1,250 students from a survey designed to get some data on student's perspective on the ongoing controversy of the People's Park project and summarized the survey responses
- Hypothesis Testing Using P-Values: Permuted datasets and the desired statistic to compute p-values and reject or accept null hypothesis for different datasets using the infer library

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Bank Marketing Prediction		Bank Marketing Prediction
Bike Sharing		Bike Sharing
IMDb Analysis Using SQL		IMDb Analysis Using SQL
Predicting Housing Prices in Cook County		Predicting Housing Prices in Cook County
Projects in R		Projects in R
Text Analysis Using Twitter		Text Analysis Using Twitter
Tuscan RFM Marketing Analysis		Tuscan RFM Marketing Analysis
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bank Marketing Prediction

Bank Marketing Prediction

Bike Sharing

Bike Sharing

IMDb Analysis Using SQL

IMDb Analysis Using SQL

Predicting Housing Prices in Cook County

Predicting Housing Prices in Cook County

Projects in R

Projects in R

Text Analysis Using Twitter

Text Analysis Using Twitter

Tuscan RFM Marketing Analysis

Tuscan RFM Marketing Analysis

README.md

README.md

Repository files navigation

DataSciencePortfolio

Contents

Machine Learning

Data Analysis

About

Releases

Packages

Contributors 2

Languages

jenniferliangc/DataSciencePortfolio

Folders and files

Latest commit

History

Repository files navigation

DataSciencePortfolio

Contents

Machine Learning

Data Analysis

About

Topics

Resources

Stars

Watchers

Forks

Languages