What Is My House Worth?

by Kevin smith 7/26/2022

Project Goal

The goal of this project is to develop a home price estimation model that performs better than the baseline prediction, and develop recommendations for ways that the model can be improved and deployed.

This goal will be accomplished utilizing the following steps:

Planning
Acqusition
Prep
Exploration
Feature Engineering
Modeling
Delivery

Steps For Reproduction

You will need an env.py file that contains the hostname, username and password of the mySQL database that contains the telco_churn database. Store that env file locally in the repository.
Clone my repo (including the acquire.py , prepare.py & wrangle.pyfiles.
The libraries used are pandas, numpy, scipy, matplotlib, seaborn, and sklearn.
You should now be able to run the zillow_final_report.ipynb file.

Planning

Their are two essential parts to any good plan. Identify your Goals, and the necessary Steps to get there.

Goals:

Identify variables driving housing prices.
Develop a model to make value predicitons based on those variable.
Deliver actionable takeaways

Steps:

Initial hypothesis
Acquire and cache the dataset
Clean, prep, and split the data to prevent data leakage
Do some preliminary exploration of the data (including visualiztions and statistical analyses)*
Trim dataset of variables that are not statistically significant
Determine which machine learning model perfoms the best
Utilize the best model on the test dataset
Create a final report notebook with streamlined code optimized for a technical audience

*at least 4 visualizations and 2 statistical analyses

Data Library

Variable Name	Explanation	Values
bedrooms	The number of bedrooms in the house	UNumeric value
bathrooms	The number of bathrooms in the house	Numeric value
quality	a numeric score based on quality on construction	Numeric value
sq_feet	The total area inside the home	Numeric value
pool	Whether or not the house has a pool	Yes=1/No=0
tax_value	The taxable value of the home in $USD	Numeric
yearbuilt	The year in which the home was originally built	Year
fips	A unique code specific to the county in which the home is located	Numeric

Initial Hypothesis

The initial hypothesis can be based on a gut instinct or the first question that comes to mind when encountering a dataset.

Initial hypothesis number	hypothesis
Initial hypothesis 1	Square footage drives up home value
Initial hypothesis 2	Age drives down home value

Acquire and Cache

Utitlize the functions imported from the acquire.py to create a DataFrame with pandas.

These functions will also cache the data to reduce execution time in the future should we need to create the DataFrame again.

Prep

In this step we will utilize the functions in the wrangle.py file to get our data ready for exploration.

This means that we will be looking for columns that may be dropped because they are duplicates, and either dropping or filling any rows that contain blanks depending on the number of blank rows there are.

This also means that we will be splitting the data into 3 separate DataFrames in order to prevent data leakage corrupting our exploration and modeling phases.

Exploration

This is the fun part! this is where we get to ask questions, form hypothesss based on the answers to those questions and use our skills as data scientist to evaluate those hypotheses!

For example, in the Telco dataset I asked "Do people who pay more, churn more?" and unsurprisingly the answer was generally yes. This lead me to the hypothesis that churn would have a dependent relationship with monthly charges, which hypothesis testing confirmed. However I was able to find 3 other variable that did a better job of predicting churn.

Feature Engineering

In an effort to minimize the stress on our machine learning models I created a function that performed statistical analysis on each column based on the data type of the values in that column to determine which, if any, columns were not statistically important and therfore could be dropped.

I found that phone service had no statistical impact on churn and could be dropped.

I also found that total charges were drastically lower for customers who churned even though their monthly charges were slightly higher on average. This is because most churn happened early in a customers tenure. So, I went ahead and dropped this column as well.

Modeling

Here we determine the best model to use for predicting churn. I optimized the models for accuracy because the project specifically called for the most accurate model.

The Random Forest model performed the best with an accuracy of 80% on the train data, 78% on the validate data, and 79% on the test data. Which means that it can be expected to perform with accuracy in the high 70's on any future data.

Delivery

Here we will complete the goal of the project by delivering actionable suggestions to reduce monthly churn based on our identification of contributing factors.

Since the project stipulates that the Month-to-Month contract type is not going anywhere, my first suggestion is to offer a slight discount for customers who utilize on of the auotpay options (bank transfer/Credit Card). This address both the fact that people who churn pay more per month on average, and the fact that people who pay by electronic check are more likely to churn than any of the other options combined.

my second suggestion is to send an automatic offer one month of an additional discount for filling out a survey that is automatically sent to any user that the model predicts will churn. This will likely entice people to stay for at least one month longer, and will ultimately generate even more data that can be used to create more accurate models.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
explore.ipynb		explore.ipynb
explore.py		explore.py
model.ipynb		model.ipynb
model.py		model.py
wrangle.py		wrangle.py
zillow_final.ipynb		zillow_final.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

evaluate.py

evaluate.py

explore.ipynb

explore.ipynb

explore.py

explore.py

model.ipynb

model.ipynb

model.py

model.py

wrangle.py

wrangle.py

zillow_final.ipynb

zillow_final.ipynb

Repository files navigation

What Is My House Worth?

Project Goal

Steps For Reproduction

Planning

Goals:

Steps:

Data Library

Initial Hypothesis

Acquire and Cache

Prep

Exploration

Feature Engineering

Modeling

Delivery

About

Releases

Packages

Languages

kevin-mal-smith/Zillow_Regression

Folders and files

Latest commit

History

Repository files navigation

What Is My House Worth?

Project Goal

Steps For Reproduction

Planning

Goals:

Steps:

Data Library

Initial Hypothesis

Acquire and Cache

Prep

Exploration

Feature Engineering

Modeling

Delivery

About

Topics

Resources

Stars

Watchers

Forks

Languages