Skip to content

SteveRNewman/Predicting-Sales-Prices-of-Homes

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Final Project Submission

Please fill out:

Project Objectives

The objective of this project is to find the best combination of variables to predict the highest price a house in King County, WA can be sold for.

Questions to Answer

  1. Which processes can be automated by functions?
  2. How to prepare the variables for EDA/modeling?
  3. How to approach modeling a category with over 70 options (zipcodes)?
  4. Which variables should be eliminated due to correlation?
  5. What is the best method to remove outliers?
  6. Which variables should be selected for the model?

OBTAIN

!pip install -U fsds_100719
from fsds_100719.imports import *
import scipy.stats as stats
import seaborn as sns
from sklearn.preprocessing import StandardScaler
Requirement already up-to-date: fsds_100719 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (0.4.45)
Requirement already satisfied, skipping upgrade: pprint in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (0.1)
Requirement already satisfied, skipping upgrade: numpy in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (1.16.5)
Requirement already satisfied, skipping upgrade: pyperclip in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (1.7.0)
Requirement already satisfied, skipping upgrade: ipywidgets in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (7.5.1)
Requirement already satisfied, skipping upgrade: pandas-profiling in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (2.3.0)
Requirement already satisfied, skipping upgrade: wordcloud in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (1.6.0)
Requirement already satisfied, skipping upgrade: tzlocal in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (2.0.0)
Requirement already satisfied, skipping upgrade: seaborn in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (0.9.0)
Requirement already satisfied, skipping upgrade: missingno in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (0.4.2)
Requirement already satisfied, skipping upgrade: scikit-learn in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (0.21.2)
Requirement already satisfied, skipping upgrade: IPython in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (7.8.0)
Requirement already satisfied, skipping upgrade: scipy in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (1.3.1)
Requirement already satisfied, skipping upgrade: matplotlib in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (3.1.1)
Requirement already satisfied, skipping upgrade: pandas in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from fsds_100719) (0.25.1)
Requirement already satisfied, skipping upgrade: nbformat>=4.2.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from ipywidgets->fsds_100719) (4.4.0)
Requirement already satisfied, skipping upgrade: ipykernel>=4.5.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from ipywidgets->fsds_100719) (5.1.2)
Requirement already satisfied, skipping upgrade: traitlets>=4.3.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from ipywidgets->fsds_100719) (4.3.2)
Requirement already satisfied, skipping upgrade: widgetsnbextension~=3.5.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from ipywidgets->fsds_100719) (3.5.1)
Requirement already satisfied, skipping upgrade: confuse>=1.0.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pandas-profiling->fsds_100719) (1.0.0)
Requirement already satisfied, skipping upgrade: phik>=0.9.8 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pandas-profiling->fsds_100719) (0.9.8)
Requirement already satisfied, skipping upgrade: astropy in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pandas-profiling->fsds_100719) (3.2.3)
Requirement already satisfied, skipping upgrade: jinja2>=2.8 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pandas-profiling->fsds_100719) (2.10.1)
Requirement already satisfied, skipping upgrade: htmlmin>=0.1.12 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pandas-profiling->fsds_100719) (0.1.12)
Requirement already satisfied, skipping upgrade: pillow in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from wordcloud->fsds_100719) (6.1.0)
Requirement already satisfied, skipping upgrade: pytz in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from tzlocal->fsds_100719) (2019.2)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from scikit-learn->fsds_100719) (0.13.2)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (4.7.0)
Requirement already satisfied, skipping upgrade: setuptools>=18.5 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (41.2.0)
Requirement already satisfied, skipping upgrade: backcall in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (0.1.0)
Requirement already satisfied, skipping upgrade: pickleshare in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (0.7.5)
Requirement already satisfied, skipping upgrade: jedi>=0.10 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (0.15.1)
Requirement already satisfied, skipping upgrade: decorator in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (4.4.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.1.0,>=2.0.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (2.0.9)
Requirement already satisfied, skipping upgrade: appnope; sys_platform == "darwin" in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (0.1.0)
Requirement already satisfied, skipping upgrade: pygments in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from IPython->fsds_100719) (2.4.2)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from matplotlib->fsds_100719) (0.10.0)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from matplotlib->fsds_100719) (1.1.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from matplotlib->fsds_100719) (2.4.2)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from matplotlib->fsds_100719) (2.8.0)
Requirement already satisfied, skipping upgrade: ipython_genutils in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbformat>=4.2.0->ipywidgets->fsds_100719) (0.2.0)
Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbformat>=4.2.0->ipywidgets->fsds_100719) (3.0.2)
Requirement already satisfied, skipping upgrade: jupyter_core in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbformat>=4.2.0->ipywidgets->fsds_100719) (4.5.0)
Requirement already satisfied, skipping upgrade: tornado>=4.2 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from ipykernel>=4.5.1->ipywidgets->fsds_100719) (6.0.3)
Requirement already satisfied, skipping upgrade: jupyter-client in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from ipykernel>=4.5.1->ipywidgets->fsds_100719) (5.3.3)
Requirement already satisfied, skipping upgrade: six in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from traitlets>=4.3.1->ipywidgets->fsds_100719) (1.12.0)
Requirement already satisfied, skipping upgrade: notebook>=4.4.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from widgetsnbextension~=3.5.0->ipywidgets->fsds_100719) (5.7.8)
Requirement already satisfied, skipping upgrade: pyyaml in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from confuse>=1.0.0->pandas-profiling->fsds_100719) (5.1.2)
Requirement already satisfied, skipping upgrade: pytest>=4.0.2 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from phik>=0.9.8->pandas-profiling->fsds_100719) (5.3.0)
Requirement already satisfied, skipping upgrade: nbconvert>=5.3.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from phik>=0.9.8->pandas-profiling->fsds_100719) (5.5.0)
Requirement already satisfied, skipping upgrade: pytest-pylint>=0.13.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from phik>=0.9.8->pandas-profiling->fsds_100719) (0.14.1)
Requirement already satisfied, skipping upgrade: numba>=0.38.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from phik>=0.9.8->pandas-profiling->fsds_100719) (0.46.0)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from jinja2>=2.8->pandas-profiling->fsds_100719) (1.1.1)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pexpect; sys_platform != "win32"->IPython->fsds_100719) (0.6.0)
Requirement already satisfied, skipping upgrade: parso>=0.5.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from jedi>=0.10->IPython->fsds_100719) (0.5.1)
Requirement already satisfied, skipping upgrade: wcwidth in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->IPython->fsds_100719) (0.1.7)
Requirement already satisfied, skipping upgrade: pyrsistent>=0.14.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->fsds_100719) (0.14.11)
Requirement already satisfied, skipping upgrade: attrs>=17.4.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets->fsds_100719) (19.1.0)
Requirement already satisfied, skipping upgrade: pyzmq>=13 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets->fsds_100719) (18.1.0)
Requirement already satisfied, skipping upgrade: terminado>=0.8.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->fsds_100719) (0.8.2)
Requirement already satisfied, skipping upgrade: Send2Trash in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->fsds_100719) (1.5.0)
Requirement already satisfied, skipping upgrade: prometheus-client in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets->fsds_100719) (0.7.1)
Requirement already satisfied, skipping upgrade: pluggy<1.0,>=0.12 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas-profiling->fsds_100719) (0.12.0)
Requirement already satisfied, skipping upgrade: more-itertools>=4.0.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas-profiling->fsds_100719) (7.0.0)
Requirement already satisfied, skipping upgrade: py>=1.5.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas-profiling->fsds_100719) (1.8.0)
Requirement already satisfied, skipping upgrade: packaging in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas-profiling->fsds_100719) (19.2)
Requirement already satisfied, skipping upgrade: importlib-metadata>=0.12; python_version < "3.8" in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pytest>=4.0.2->phik>=0.9.8->pandas-profiling->fsds_100719) (0.17)
Requirement already satisfied, skipping upgrade: mistune>=0.8.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas-profiling->fsds_100719) (0.8.4)
Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas-profiling->fsds_100719) (1.4.2)
Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas-profiling->fsds_100719) (0.3)
Requirement already satisfied, skipping upgrade: testpath in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas-profiling->fsds_100719) (0.4.2)
Requirement already satisfied, skipping upgrade: bleach in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas-profiling->fsds_100719) (1.5.0)
Requirement already satisfied, skipping upgrade: defusedxml in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from nbconvert>=5.3.1->phik>=0.9.8->pandas-profiling->fsds_100719) (0.6.0)
Requirement already satisfied, skipping upgrade: pylint>=1.4.5 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pytest-pylint>=0.13.0->phik>=0.9.8->pandas-profiling->fsds_100719) (2.4.4)
Requirement already satisfied, skipping upgrade: llvmlite>=0.30.0dev0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from numba>=0.38.1->phik>=0.9.8->pandas-profiling->fsds_100719) (0.30.0)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from importlib-metadata>=0.12; python_version < "3.8"->pytest>=4.0.2->phik>=0.9.8->pandas-profiling->fsds_100719) (0.5.1)
Requirement already satisfied, skipping upgrade: html5lib!=0.9999,!=0.99999,<0.99999999,>=0.999 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from bleach->nbconvert>=5.3.1->phik>=0.9.8->pandas-profiling->fsds_100719) (0.9999999)
Requirement already satisfied, skipping upgrade: mccabe<0.7,>=0.6 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas-profiling->fsds_100719) (0.6.1)
Requirement already satisfied, skipping upgrade: isort<5,>=4.2.5 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas-profiling->fsds_100719) (4.3.21)
Requirement already satisfied, skipping upgrade: astroid<2.4,>=2.3.0 in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas-profiling->fsds_100719) (2.3.3)
Requirement already satisfied, skipping upgrade: lazy-object-proxy==1.4.* in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from astroid<2.4,>=2.3.0->pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas-profiling->fsds_100719) (1.4.3)
Requirement already satisfied, skipping upgrade: wrapt==1.11.* in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from astroid<2.4,>=2.3.0->pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas-profiling->fsds_100719) (1.11.2)
Requirement already satisfied, skipping upgrade: typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8" in /Users/srn/anaconda3/envs/learn-env/lib/python3.6/site-packages (from astroid<2.4,>=2.3.0->pylint>=1.4.5->pytest-pylint>=0.13.0->phik>=0.9.8->pandas-profiling->fsds_100719) (1.4.0)
pd.set_option('display.max_columns',0)
csv="https://raw.githubusercontent.com/learn-co-students/dsc-v2-mod1-final-project-online-ds-pt-100719/master/kc_house_data.csv"
df = pd.read_csv(csv)
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 10/13/2014 221900.0 3 1.00 1180 5650 1.0 NaN 0.0 3 7 1180 0.0 1955 0.0 98178 47.5112 -122.257 1340 5650
1 6414100192 12/9/2014 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 2170 400.0 1951 1991.0 98125 47.7210 -122.319 1690 7639
2 5631500400 2/25/2015 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 770 0.0 1933 NaN 98028 47.7379 -122.233 2720 8062
3 2487200875 12/9/2014 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 1050 910.0 1965 0.0 98136 47.5208 -122.393 1360 5000
4 1954400510 2/18/2015 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 1680 0.0 1987 0.0 98074 47.6168 -122.045 1800 7503
df.shape
(21597, 21)
df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 2.159700e+04 2.159700e+04 21597.000000 21597.000000 21597.000000 2.159700e+04 21597.000000 19221.000000 21534.000000 21597.000000 21597.000000 21597.000000 21597.000000 17755.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000
mean 4.580474e+09 5.402966e+05 3.373200 2.115826 2080.321850 1.509941e+04 1.494096 0.007596 0.233863 3.409825 7.657915 1788.596842 1970.999676 83.636778 98077.951845 47.560093 -122.213982 1986.620318 12758.283512
std 2.876736e+09 3.673681e+05 0.926299 0.768984 918.106125 4.141264e+04 0.539683 0.086825 0.765686 0.650546 1.173200 827.759761 29.375234 399.946414 53.513072 0.138552 0.140724 685.230472 27274.441950
min 1.000102e+06 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 3.000000 370.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 399.000000 651.000000
25% 2.123049e+09 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 1951.000000 0.000000 98033.000000 47.471100 -122.328000 1490.000000 5100.000000
50% 3.904930e+09 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 1975.000000 0.000000 98065.000000 47.571800 -122.231000 1840.000000 7620.000000
75% 7.308900e+09 6.450000e+05 4.000000 2.500000 2550.000000 1.068500e+04 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 1997.000000 0.000000 98118.000000 47.678000 -122.125000 2360.000000 10083.000000
max 9.900000e+09 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000
df.distplot[]
  File "<ipython-input-21-dbb0f7b4373d>", line 1
    df.distplot[]
                ^
SyntaxError: invalid syntax

SCRUB

Functions

def check_column(df, col_name, n_unique=10, target='price'):
    
    print('DataType:')
    print('\t',df[col_name].dtypes)
    
    num_nulls = df[col_name].isna().sum()
    print(f'Null Values Present = {num_nulls}')
    
    display(df[col_name].describe().round(3))
    
    print('\nValue Counts:')
    display(df[col_name].value_counts(n_unique))
    
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16,4))
    df.plot(kind='scatter', x=col_name, y=target, ax=ax[0])
    
    sns.boxplot(df[col_name], ax=ax[1])
        
        
        
def check_column_object(df, col_name, n_unique=10, target='price'):
    
    print('DataType:')
    print('\t',df[col_name].dtypes)
    
    num_nulls = df[col_name].isna().sum()
    print(f'Null Values Present = {num_nulls}')
    
    stats = df[col_name].agg(['min','median','max'])
    print(f'Stats = {stats}')
    
    print('\nValue Counts:')
    display(df[col_name].value_counts(n_unique))

    

Variables

id

Drop ID as it does not have any statistical relevance to house values.

check_column(df,'id')
DataType:
	 int64
Null Values Present = 0



count    2.159700e+04
mean     4.580474e+09
std      2.876736e+09
min      1.000102e+06
25%      2.123049e+09
50%      3.904930e+09
75%      7.308900e+09
max      9.900000e+09
Name: id, dtype: float64



Value Counts:



795000620     0.000139
1825069031    0.000093
2019200220    0.000093
7129304540    0.000093
1781500435    0.000093
                ...   
7812801125    0.000046
4364700875    0.000046
3021059276    0.000046
880000205     0.000046
1777500160    0.000046
Name: id, Length: 21420, dtype: float64

png

date

Drop Date

df['date'] = pd.to_datetime(df['date'])
df['date'].dtype
dtype('<M8[ns]')
check_column_object(df,'date')
DataType:
	 datetime64[ns]
Null Values Present = 0
Stats = min   2014-05-02
max   2015-05-27
Name: date, dtype: datetime64[ns]

Value Counts:



2014-06-23    0.006575
2014-06-25    0.006066
2014-06-26    0.006066
2014-07-08    0.005880
2015-04-27    0.005834
                ...   
2014-07-27    0.000046
2015-03-08    0.000046
2014-11-02    0.000046
2015-05-15    0.000046
2015-05-24    0.000046
Name: date, Length: 372, dtype: float64

price

check_column(df,'price')
DataType:
	 float64
Null Values Present = 0



count      21597.000
mean      540296.574
std       367368.140
min        78000.000
25%       322000.000
50%       450000.000
75%       645000.000
max      7700000.000
Name: price, dtype: float64



Value Counts:



350000.0    0.007964
450000.0    0.007964
550000.0    0.007362
500000.0    0.007038
425000.0    0.006945
              ...   
870515.0    0.000046
336950.0    0.000046
386100.0    0.000046
176250.0    0.000046
884744.0    0.000046
Name: price, Length: 3622, dtype: float64

png

bedrooms

check_column(df,'bedrooms')
DataType:
	 int64
Null Values Present = 0



count    21597.000
mean         3.373
std          0.926
min          1.000
25%          3.000
50%          3.000
75%          4.000
max         33.000
Name: bedrooms, dtype: float64



Value Counts:



3     0.454878
4     0.318655
2     0.127796
5     0.074131
6     0.012594
1     0.009075
7     0.001760
8     0.000602
9     0.000278
10    0.000139
11    0.000046
33    0.000046
Name: bedrooms, dtype: float64

png

bathrooms

check_column(df,'bathrooms')
DataType:
	 float64
Null Values Present = 0



count    21597.000
mean         2.116
std          0.769
min          0.500
25%          1.750
50%          2.250
75%          2.500
max          8.000
Name: bathrooms, dtype: float64



Value Counts:



2.50    0.248970
1.00    0.178312
1.75    0.141131
2.25    0.094782
2.00    0.089364
1.50    0.066907
2.75    0.054869
3.00    0.034866
3.50    0.033847
3.25    0.027272
3.75    0.007177
4.00    0.006297
4.50    0.004630
4.25    0.003658
0.75    0.003287
4.75    0.001065
5.00    0.000972
5.25    0.000602
5.50    0.000463
1.25    0.000417
6.00    0.000278
5.75    0.000185
0.50    0.000185
8.00    0.000093
6.25    0.000093
6.75    0.000093
6.50    0.000093
7.50    0.000046
7.75    0.000046
Name: bathrooms, dtype: float64

png

sqft_living

check_column(df,'sqft_living')
DataType:
	 int64
Null Values Present = 0



count    21597.000
mean      2080.322
std        918.106
min        370.000
25%       1430.000
50%       1910.000
75%       2550.000
max      13540.000
Name: sqft_living, dtype: float64



Value Counts:



1300    0.006390
1400    0.006251
1440    0.006158
1660    0.005973
1010    0.005973
          ...   
4970    0.000046
2905    0.000046
2793    0.000046
4810    0.000046
1975    0.000046
Name: sqft_living, Length: 1034, dtype: float64

png

sqft_lot

check_column(df,'sqft_lot')
DataType:
	 int64
Null Values Present = 0



count      21597.000
mean       15099.409
std        41412.637
min          520.000
25%         5040.000
50%         7618.000
75%        10685.000
max      1651359.000
Name: sqft_lot, dtype: float64



Value Counts:



5000      0.016576
6000      0.013428
4000      0.011622
7200      0.010187
7500      0.005510
            ...   
1448      0.000046
38884     0.000046
17313     0.000046
35752     0.000046
315374    0.000046
Name: sqft_lot, Length: 9776, dtype: float64

png

floors

check_column(df,'floors')
DataType:
	 float64
Null Values Present = 0



count    21597.000
mean         1.494
std          0.540
min          1.000
25%          1.000
50%          1.500
75%          2.000
max          3.500
Name: floors, dtype: float64



Value Counts:



1.0    0.494189
2.0    0.381303
1.5    0.088438
3.0    0.028291
2.5    0.007455
3.5    0.000324
Name: floors, dtype: float64

png

waterfront

check_column(df,'waterfront')
DataType:
	 float64
Null Values Present = 2376



count    19221.000
mean         0.008
std          0.087
min          0.000
25%          0.000
50%          0.000
75%          0.000
max          1.000
Name: waterfront, dtype: float64



Value Counts:



0.0    0.992404
1.0    0.007596
Name: waterfront, dtype: float64

png

df['waterfront'].fillna(0.0, inplace=True)

df['waterfront'].isna().sum()
0
df['wf'] = df['waterfront'].copy()

df['wf'].value_counts()
0.0    21451
1.0      146
Name: wf, dtype: int64

view

check_column(df,'view')
DataType:
	 float64
Null Values Present = 63



count    21534.000
mean         0.234
std          0.766
min          0.000
25%          0.000
50%          0.000
75%          0.000
max          4.000
Name: view, dtype: float64



Value Counts:



0.0    0.901923
2.0    0.044441
3.0    0.023591
1.0    0.015325
4.0    0.014721
Name: view, dtype: float64

png

df['view'].fillna(0, inplace=True)

df['view'].isna().sum()
0
df['viewed'] = df['view'].astype('bool')

df['viewed'].value_counts()
False    19485
True      2112
Name: viewed, dtype: int64

condition

check_column(df,'condition')
DataType:
	 int64
Null Values Present = 0



count    21597.000
mean         3.410
std          0.651
min          1.000
25%          3.000
50%          3.000
75%          4.000
max          5.000
Name: condition, dtype: float64



Value Counts:



3    0.649164
4    0.262861
5    0.078761
2    0.007871
1    0.001343
Name: condition, dtype: float64

png

grade

check_column(df,'grade')
DataType:
	 int64
Null Values Present = 0



count    21597.000
mean         7.658
std          1.173
min          3.000
25%          7.000
50%          7.000
75%          8.000
max         13.000
Name: grade, dtype: float64



Value Counts:



7     0.415521
8     0.280826
9     0.121082
6     0.094365
10    0.052507
11    0.018475
5     0.011205
12    0.004121
4     0.001250
13    0.000602
3     0.000046
Name: grade, dtype: float64

png

sqft_above

check_column(df,'sqft_above')
DataType:
	 int64
Null Values Present = 0



count    21597.000
mean      1788.597
std        827.760
min        370.000
25%       1190.000
50%       1560.000
75%       2210.000
max       9410.000
Name: sqft_above, dtype: float64



Value Counts:



1300    0.009816
1010    0.009724
1200    0.009538
1220    0.008890
1140    0.008520
          ...   
2601    0.000046
440     0.000046
2473    0.000046
2441    0.000046
1975    0.000046
Name: sqft_above, Length: 942, dtype: float64

png

sqft_basement

df['sqft_basement'].replace(to_replace='?', value='0.0', inplace=True)
df['sqft_basement'] = df['sqft_basement'].astype('float')
check_column(df,'sqft_basement')
DataType:
	 float64
Null Values Present = 0



count    21597.000
mean       285.717
std        439.820
min          0.000
25%          0.000
50%          0.000
75%        550.000
max       4820.000
Name: sqft_basement, dtype: float64



Value Counts:



0.0       0.614900
600.0     0.010048
500.0     0.009677
700.0     0.009631
800.0     0.009307
            ...   
915.0     0.000046
295.0     0.000046
1281.0    0.000046
2130.0    0.000046
906.0     0.000046
Name: sqft_basement, Length: 303, dtype: float64

png

yr_built

check_column(df,'yr_built')
DataType:
	 int64
Null Values Present = 0



count    21597.000
mean      1971.000
std         29.375
min       1900.000
25%       1951.000
50%       1975.000
75%       1997.000
max       2015.000
Name: yr_built, dtype: float64



Value Counts:



2014    0.025883
2006    0.020975
2005    0.020836
2004    0.020049
2003    0.019447
          ...   
1933    0.001389
1901    0.001343
1902    0.001250
1935    0.001111
1934    0.000972
Name: yr_built, Length: 116, dtype: float64

png

quantile_list = [0, .25, .5, .75, 1.]
quantiles = df['yr_built'].quantile(quantile_list)

quantiles 
0.00    1900.0
0.25    1951.0
0.50    1975.0
0.75    1997.0
1.00    2015.0
Name: yr_built, dtype: float64
yr_built_bins = [1899, 1951, 1975, 1997, 2015]
yb_labels = [1, 2, 3, 4]

df['yr_range'] = pd.cut(df['yr_built'], bins=yr_built_bins)

df['yr_category'] = pd.cut(df['yr_built'], bins=yr_built_bins, labels=yb_labels) 
df[['yr_built','yr_range', 'yr_category']].iloc[800:810] 
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
yr_built yr_range yr_category
800 1913 (1899, 1951] 1
801 1967 (1951, 1975] 2
802 1987 (1975, 1997] 3
803 2007 (1997, 2015] 4
804 1954 (1951, 1975] 2
805 1989 (1975, 1997] 3
806 1989 (1975, 1997] 3
807 1977 (1975, 1997] 3
808 2004 (1997, 2015] 4
809 1999 (1997, 2015] 4

yr_renovated

check_column(df,'yr_renovated')
DataType:
	 float64
Null Values Present = 3842



count    17755.000
mean        83.637
std        399.946
min          0.000
25%          0.000
50%          0.000
75%          0.000
max       2015.000
Name: yr_renovated, dtype: float64



Value Counts:



0.0       0.958096
2014.0    0.004112
2003.0    0.001746
2013.0    0.001746
2007.0    0.001690
            ...   
1946.0    0.000056
1959.0    0.000056
1971.0    0.000056
1951.0    0.000056
1954.0    0.000056
Name: yr_renovated, Length: 70, dtype: float64

png

df['yr_renovated'].fillna(0.0, inplace=True)

df['yr_renovated'].isna().sum()
0
df['is_renovated'] = np.where (df['yr_renovated'] == 0.0, 0, 1)

df['is_renovated'].value_counts()
0    20853
1      744
Name: is_renovated, dtype: int64

zipcode

check_column(df,'zipcode')
DataType:
	 int64
Null Values Present = 0



count    21597.000
mean     98077.952
std         53.513
min      98001.000
25%      98033.000
50%      98065.000
75%      98118.000
max      98199.000
Name: zipcode, dtype: float64



Value Counts:



98103    0.027874
98038    0.027272
98115    0.026994
98052    0.026578
98117    0.025605
           ...   
98102    0.004815
98010    0.004630
98024    0.003704
98148    0.002639
98039    0.002315
Name: zipcode, Length: 70, dtype: float64

png

lat long

Eliminated due to correlation with zipcodes.

sqft_living15

check_column(df,'sqft_living15')
DataType:
	 int64
Null Values Present = 0



count    21597.00
mean      1986.62
std        685.23
min        399.00
25%       1490.00
50%       1840.00
75%       2360.00
max       6210.00
Name: sqft_living15, dtype: float64



Value Counts:



1540    0.009122
1440    0.009029
1560    0.008890
1500    0.008334
1460    0.007825
          ...   
4890    0.000046
2873    0.000046
952     0.000046
3193    0.000046
2049    0.000046
Name: sqft_living15, Length: 777, dtype: float64

png

sqft_lot15

check_column(df,'sqft_lot15')
DataType:
	 int64
Null Values Present = 0



count     21597.000
mean      12758.284
std       27274.442
min         651.000
25%        5100.000
50%        7620.000
75%       10083.000
max      871200.000
Name: sqft_lot15, dtype: float64



Value Counts:



5000      0.019771
4000      0.016484
6000      0.013335
7200      0.009724
4800      0.006714
            ...   
11036     0.000046
8989      0.000046
871200    0.000046
809       0.000046
6147      0.000046
Name: sqft_lot15, Length: 8682, dtype: float64

png

EXPLORE

Correlation

corr = df.corr()

def corrplot(corr,figsize=(20,20)):
    fig, ax = plt.subplots(figsize=figsize)

    mask = np.zeros_like(corr, dtype=np.bool)
    idx = np.triu_indices_from(mask)
    mask[idx] = True

    sns.heatmap(np.abs(corr),square=True,mask=mask,cmap="Blues",annot=True,ax=ax)
    ax.set_ylim(len(corr), -.5, .5)
    return fig, ax

corrplot(np.abs(corr.round(3)))
(<Figure size 1440x1440 with 2 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x1a21b30780>)

png

Drop "sqft_above" due to correlation with "sqft_living" and "grade". Drop "id" and "date" due to no significant statistical meaning. Drop "lat" and "long" due to correlation with "zipcode".

drop_cols1 = ['sqft_above','id','lat', 'long', 'date']
df.drop(drop_cols1,axis=1,inplace=True)
#df_post_corr.columns

df.columns
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
       'waterfront', 'view', 'condition', 'grade', 'sqft_basement', 'yr_built',
       'yr_renovated', 'zipcode', 'sqft_living15', 'sqft_lot15', 'wf',
       'viewed', 'yr_range', 'yr_category', 'is_renovated'],
      dtype='object')
def histograms(df):
    plt.style.use('ggplot')
    for column in df.describe():
        fig = plt.figure(figsize=(12, 5))
        
        ax = fig.add_subplot(121)
        ax.hist(df[column], density=True, label = column+' histogram', bins=20)
        ax.set_title(column.capitalize())

        ax.legend()
        
        fig.tight_layout()
        
df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated
count 2.159700e+04 21597.000000 21597.000000 21597.000000 2.159700e+04 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000 21597.000000
mean 5.402966e+05 3.373200 2.115826 2080.321850 1.509941e+04 1.494096 0.006760 0.233181 3.409825 7.657915 285.716581 1970.999676 68.758207 98077.951845 1986.620318 12758.283512 0.006760 0.034449
std 3.673681e+05 0.926299 0.768984 918.106125 4.141264e+04 0.539683 0.081944 0.764673 0.650546 1.173200 439.819830 29.375234 364.037499 53.513072 685.230472 27274.441950 0.081944 0.182384
min 7.800000e+04 1.000000 0.500000 370.000000 5.200000e+02 1.000000 0.000000 0.000000 1.000000 3.000000 0.000000 1900.000000 0.000000 98001.000000 399.000000 651.000000 0.000000 0.000000
25% 3.220000e+05 3.000000 1.750000 1430.000000 5.040000e+03 1.000000 0.000000 0.000000 3.000000 7.000000 0.000000 1951.000000 0.000000 98033.000000 1490.000000 5100.000000 0.000000 0.000000
50% 4.500000e+05 3.000000 2.250000 1910.000000 7.618000e+03 1.500000 0.000000 0.000000 3.000000 7.000000 0.000000 1975.000000 0.000000 98065.000000 1840.000000 7620.000000 0.000000 0.000000
75% 6.450000e+05 4.000000 2.500000 2550.000000 1.068500e+04 2.000000 0.000000 0.000000 4.000000 8.000000 550.000000 1997.000000 0.000000 98118.000000 2360.000000 10083.000000 0.000000 0.000000
max 7.700000e+06 33.000000 8.000000 13540.000000 1.651359e+06 3.500000 1.000000 4.000000 5.000000 13.000000 4820.000000 2015.000000 2015.000000 98199.000000 6210.000000 871200.000000 1.000000 1.000000
histograms(df)

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

log_cols = ['sqft_living', 'sqft_living15']


for col in log_cols:
    df[col+'_log'] = np.log(df[col])
plt.style.use('ggplot')
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(12,12))
sns.distplot(df['sqft_living'], ax=ax[0][0])
sns.distplot(df['sqft_living_log'], ax=ax[0][1])
sns.distplot(df['sqft_living15'], ax=ax[1][0])
sns.distplot(df['sqft_living15_log'], ax=ax[1][1])
<matplotlib.axes._subplots.AxesSubplot at 0x1a2386a358>

png

df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf viewed yr_range yr_category is_renovated sqft_living_log sqft_living15_log
0 221900.0 3 1.00 1180 5650 1.0 0.0 0.0 3 7 0.0 1955 0.0 98178 1340 5650 0.0 False (1951, 1975] 2 0 7.073270 7.200425
1 538000.0 3 2.25 2570 7242 2.0 0.0 0.0 3 7 400.0 1951 1991.0 98125 1690 7639 0.0 False (1899, 1951] 1 1 7.851661 7.432484
2 180000.0 2 1.00 770 10000 1.0 0.0 0.0 3 6 0.0 1933 0.0 98028 2720 8062 0.0 False (1899, 1951] 1 0 6.646391 7.908387
3 604000.0 4 3.00 1960 5000 1.0 0.0 0.0 5 7 910.0 1965 0.0 98136 1360 5000 0.0 False (1951, 1975] 2 0 7.580700 7.215240
4 510000.0 3 2.00 1680 8080 1.0 0.0 0.0 3 8 0.0 1987 0.0 98074 1800 7503 0.0 False (1975, 1997] 3 0 7.426549 7.495542

Scaling

Scale everything except the target and boolean variables.

import warnings
warnings.filterwarnings('ignore')

scale_cols = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'view', 'condition',
              'grade', 'sqft_basement', 'sqft_living15', 'sqft_lot15', 'sqft_living_log',
              'sqft_living15_log']

scaler = StandardScaler()


for col in scale_cols:
    col_data = df[col].values
    stdscale = scaler.fit_transform(col_data.reshape(-1, 1))
    df['sca_'+col] = stdscale.flatten()
    
df.describe().round(3)    
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.00 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000 21597.000
mean 540296.574 3.373 2.116 2080.322 15099.409 1.494 0.007 0.233 3.410 7.658 285.717 1971.000 68.758 98077.952 1986.62 12758.284 0.007 0.034 7.551 7.539 -0.000 0.000 -0.000 0.000 -0.000 0.000 0.000 0.000 -0.000 0.000 0.000 -0.000 -0.000
std 367368.140 0.926 0.769 918.106 41412.637 0.540 0.082 0.765 0.651 1.173 439.820 29.375 364.037 53.513 685.23 27274.442 0.082 0.182 0.424 0.327 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
min 78000.000 1.000 0.500 370.000 520.000 1.000 0.000 0.000 1.000 3.000 0.000 1900.000 0.000 98001.000 399.00 651.000 0.000 0.000 5.914 5.989 -2.562 -2.101 -1.863 -0.352 -0.916 -0.305 -3.704 -3.970 -0.650 -2.317 -0.444 -3.860 -4.735
25% 322000.000 3.000 1.750 1430.000 5040.000 1.000 0.000 0.000 3.000 7.000 0.000 1951.000 0.000 98033.000 1490.00 5100.000 0.000 0.000 7.265 7.307 -0.403 -0.476 -0.708 -0.243 -0.916 -0.305 -0.630 -0.561 -0.650 -0.725 -0.281 -0.673 -0.711
50% 450000.000 3.000 2.250 1910.000 7618.000 1.500 0.000 0.000 3.000 7.000 0.000 1975.000 0.000 98065.000 1840.00 7620.000 0.000 0.000 7.555 7.518 -0.403 0.174 -0.186 -0.181 0.011 -0.305 -0.630 -0.561 -0.650 -0.214 -0.188 0.010 -0.067
75% 645000.000 4.000 2.500 2550.000 10685.000 2.000 0.000 0.000 4.000 8.000 550.000 1997.000 0.000 98118.000 2360.00 10083.000 0.000 0.000 7.844 7.766 0.677 0.500 0.512 -0.107 0.937 -0.305 0.907 0.292 0.601 0.545 -0.098 0.691 0.693
max 7700000.000 33.000 8.000 13540.000 1651359.000 3.500 1.000 4.000 5.000 13.000 4820.000 2015.000 2015.000 98199.000 6210.00 871200.000 1.000 1.000 9.513 8.734 31.985 7.652 12.482 39.512 3.717 4.926 2.444 4.554 10.310 6.164 31.475 4.627 3.648
df.shape
(21597, 36)

Drop Outliers

def find_outliers(col):
    """Use scipy to calcualte absoliute Z-scores 
    and return boolean series where True indicates it is an outlier
    Args:
        col (Series): a series/column from your DataFrame
    Returns:
        idx_outliers (Series): series of  True/False for each row in col
        
    Ex:
    >> idx_outs = find_outliers(df['bedrooms'])
    >> df_clean = df.loc[idx_outs==False]"""
    from scipy import stats
    z = np.abs(stats.zscore(col))
    idx_outliers = np.where(z>3,True,False)
    return pd.Series(idx_outliers,index=col.index)

idx = find_outliers(df['price'])
idx
0        False
1        False
2        False
3        False
4        False
         ...  
21592    False
21593    False
21594    False
21595    False
21596    False
Length: 21597, dtype: bool
display(df.loc[idx==True].describe().round(3))
display(df.loc[idx==False].describe().round(3))
df = df.loc[idx==False]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.00 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000 406.000
mean 2277660.099 4.266 3.594 4577.089 29331.751 1.872 0.163 1.717 3.502 10.404 842.675 1974.818 250.970 98060.32 3355.771 19364.182 0.163 0.126 8.390 8.088 0.964 1.922 2.720 0.344 0.700 1.940 0.142 2.341 1.266 1.998 0.242 1.978 1.676
std 768348.039 0.944 0.978 1395.282 76438.853 0.474 0.369 1.675 0.726 1.163 809.210 33.560 662.969 56.07 815.443 30513.711 0.369 0.332 0.272 0.250 1.020 1.272 1.520 1.846 0.878 2.191 1.116 0.991 1.840 1.190 1.119 0.641 0.763
min 1650000.000 2.000 1.750 2360.000 1880.000 1.000 0.000 0.000 2.000 7.000 0.000 1900.000 0.000 98004.00 1490.000 2199.000 0.000 0.000 7.766 7.307 -1.482 -0.476 0.305 -0.319 -0.916 -0.305 -2.167 -0.561 -0.650 -0.725 -0.387 0.509 -0.711
25% 1792500.000 4.000 3.000 3715.000 8864.500 2.000 0.000 0.000 3.000 10.000 0.000 1950.000 0.000 98006.00 2870.000 8557.250 0.000 0.000 8.220 7.962 0.677 1.150 1.781 -0.151 0.937 -0.305 -0.630 1.996 -0.650 1.289 -0.154 1.578 1.290
50% 2000000.000 4.000 3.500 4325.000 14754.000 2.000 0.000 2.000 3.000 10.000 835.000 1988.000 0.000 98040.00 3290.000 13224.500 0.000 0.000 8.372 8.099 0.677 1.800 2.445 -0.008 0.937 2.311 -0.630 1.996 1.249 1.902 0.017 1.937 1.708
75% 2497500.000 5.000 4.250 5150.000 21241.000 2.000 0.000 3.000 4.000 11.000 1377.500 2004.000 0.000 98112.00 3847.500 19273.250 0.000 0.000 8.547 8.255 1.756 2.775 3.344 0.148 0.937 3.618 0.907 2.849 2.482 2.716 0.239 2.348 2.186
max 7700000.000 8.000 8.000 13540.000 920423.000 3.500 1.000 4.000 5.000 13.000 4820.000 2015.000 2014.000 98199.00 6210.000 411962.000 1.000 1.000 9.513 8.734 4.995 7.652 12.482 21.862 3.717 4.926 2.444 4.554 10.310 6.164 14.637 4.627 3.648
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000
mean 507010.292 3.356 2.088 2032.486 14826.730 1.487 0.004 0.205 3.408 7.605 275.046 1970.927 65.267 98078.290 1960.389 12631.721 0.004 0.033 7.535 7.529 -0.018 -0.037 -0.052 -0.007 -0.013 -0.037 -0.003 -0.045 -0.024 -0.038 -0.005 -0.038 -0.032
std 259462.210 0.918 0.736 836.739 40400.947 0.538 0.061 0.707 0.649 1.109 422.581 29.285 354.984 53.407 655.151 27193.757 0.061 0.178 0.410 0.320 0.991 0.957 0.911 0.976 0.997 0.924 0.997 0.945 0.961 0.956 0.997 0.967 0.976
min 78000.000 1.000 0.500 370.000 520.000 1.000 0.000 0.000 1.000 3.000 0.000 1900.000 0.000 98001.000 399.000 651.000 0.000 0.000 5.914 5.989 -2.562 -2.101 -1.863 -0.352 -0.916 -0.305 -3.704 -3.970 -0.650 -2.317 -0.444 -3.860 -4.735
25% 320000.000 3.000 1.500 1410.000 5005.500 1.000 0.000 0.000 3.000 7.000 0.000 1951.000 0.000 98033.000 1480.000 5080.000 0.000 0.000 7.251 7.300 -0.403 -0.801 -0.730 -0.244 -0.916 -0.305 -0.630 -0.561 -0.650 -0.739 -0.282 -0.706 -0.732
50% 447000.000 3.000 2.250 1890.000 7560.000 1.000 0.000 0.000 3.000 7.000 0.000 1975.000 0.000 98065.000 1820.000 7576.000 0.000 0.000 7.544 7.507 -0.403 0.174 -0.207 -0.182 -0.916 -0.305 -0.630 -0.561 -0.650 -0.243 -0.190 -0.015 -0.100
75% 627650.000 4.000 2.500 2500.000 10490.500 2.000 0.000 0.000 4.000 8.000 530.000 1996.000 0.000 98118.000 2330.000 10000.000 0.000 0.000 7.824 7.754 0.677 0.500 0.457 -0.111 0.937 -0.305 0.907 0.292 0.555 0.501 -0.101 0.644 0.654
max 1640000.000 33.000 7.500 7480.000 1651359.000 3.500 1.000 4.000 5.000 12.000 2850.000 2015.000 2015.000 98199.000 5790.000 871200.000 1.000 1.000 8.920 8.664 31.985 7.002 5.881 39.512 3.717 4.926 2.444 3.701 5.830 5.551 31.475 3.228 3.434
df_outliers = pd.DataFrame()
for col in df.describe().columns:
    df_outliers[col] = find_outliers(df[col])
df_outliers.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
0 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False
1 False False False False False False False False False False False False True False False False False True False False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False
test_outs = df_outliers.apply(lambda x: np.any(x),axis=1)
print(len(test_outs), df_outliers.shape)
test_outs
np.shape(test_outs)
21191 (21191, 33)





(21191,)
df.shape
(21191, 36)
np.sum(test_outs)
2536
df.loc[test_outs==False].describe().round(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.0 18655.000 18655.000 18655.000 18655.000 18655.000 18655.0 18655.000 18655.000 18655.000 18655.0 18655.0 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000
mean 467430.582 3.311 2.023 1922.376 9825.017 1.475 0.0 0.088 3.419 7.493 243.632 1971.575 0.0 98078.347 1889.875 9101.479 0.0 0.0 7.492 7.499 -0.067 -0.121 -0.172 -0.127 -0.035 -0.190 0.014 -0.141 -0.096 -0.141 -0.134 -0.139 -0.123
std 207577.583 0.841 0.677 710.718 11390.237 0.539 0.0 0.394 0.646 0.973 376.714 28.968 0.0 53.233 578.449 9114.839 0.0 0.0 0.379 0.299 0.908 0.881 0.774 0.275 0.999 0.515 0.993 0.829 0.857 0.844 0.334 0.894 0.912
min 82000.000 1.000 0.500 550.000 520.000 1.000 0.0 0.000 2.000 5.000 0.000 1900.000 0.0 98001.000 720.000 651.000 0.0 0.0 6.310 6.579 -2.562 -2.101 -1.667 -0.352 -0.916 -0.305 -2.167 -2.266 -0.650 -1.849 -0.444 -2.925 -2.932
25% 310000.000 3.000 1.500 1380.000 5000.000 1.000 0.0 0.000 3.000 7.000 0.000 1953.000 0.0 98033.000 1460.000 5000.000 0.0 0.0 7.230 7.286 -0.403 -0.801 -0.763 -0.244 -0.916 -0.305 -0.630 -0.561 -0.650 -0.769 -0.284 -0.756 -0.774
50% 427005.000 3.000 2.000 1820.000 7350.000 1.000 0.0 0.000 3.000 7.000 0.000 1975.000 0.0 98065.000 1780.000 7440.000 0.0 0.0 7.507 7.484 -0.403 -0.151 -0.284 -0.187 -0.916 -0.305 -0.630 -0.561 -0.650 -0.302 -0.195 -0.104 -0.168
75% 585000.000 4.000 2.500 2370.000 9879.000 2.000 0.0 0.000 4.000 8.000 480.000 1997.000 0.0 98118.000 2240.000 9600.000 0.0 0.0 7.771 7.714 0.677 0.500 0.316 -0.126 0.937 -0.305 0.907 0.292 0.442 0.370 -0.116 0.518 0.534
max 1280000.000 6.000 4.250 4530.000 134489.000 3.000 0.0 2.000 5.000 10.000 1540.000 2015.000 0.0 98199.000 3920.000 93825.000 0.0 0.0 8.418 8.274 2.836 2.775 2.668 2.883 2.790 2.311 2.444 1.996 2.852 2.822 2.972 2.046 2.243
df_outliers.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191 21191
unique 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
top False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False
freq 20774 21139 21081 20997 20852 21185 21111 20522 21162 20848 20965 21191 20498 21191 20976 20837 21111 20498 21155 21160 21139 21081 20997 20852 21185 20522 21162 20848 20965 20976 20837 21155 21160
df_clean = df.loc[test_outs==False]
df_clean.describe().round(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.0 18655.000 18655.000 18655.000 18655.000 18655.000 18655.0 18655.000 18655.000 18655.000 18655.0 18655.0 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000
mean 467430.582 3.311 2.023 1922.376 9825.017 1.475 0.0 0.088 3.419 7.493 243.632 1971.575 0.0 98078.347 1889.875 9101.479 0.0 0.0 7.492 7.499 -0.067 -0.121 -0.172 -0.127 -0.035 -0.190 0.014 -0.141 -0.096 -0.141 -0.134 -0.139 -0.123
std 207577.583 0.841 0.677 710.718 11390.237 0.539 0.0 0.394 0.646 0.973 376.714 28.968 0.0 53.233 578.449 9114.839 0.0 0.0 0.379 0.299 0.908 0.881 0.774 0.275 0.999 0.515 0.993 0.829 0.857 0.844 0.334 0.894 0.912
min 82000.000 1.000 0.500 550.000 520.000 1.000 0.0 0.000 2.000 5.000 0.000 1900.000 0.0 98001.000 720.000 651.000 0.0 0.0 6.310 6.579 -2.562 -2.101 -1.667 -0.352 -0.916 -0.305 -2.167 -2.266 -0.650 -1.849 -0.444 -2.925 -2.932
25% 310000.000 3.000 1.500 1380.000 5000.000 1.000 0.0 0.000 3.000 7.000 0.000 1953.000 0.0 98033.000 1460.000 5000.000 0.0 0.0 7.230 7.286 -0.403 -0.801 -0.763 -0.244 -0.916 -0.305 -0.630 -0.561 -0.650 -0.769 -0.284 -0.756 -0.774
50% 427005.000 3.000 2.000 1820.000 7350.000 1.000 0.0 0.000 3.000 7.000 0.000 1975.000 0.0 98065.000 1780.000 7440.000 0.0 0.0 7.507 7.484 -0.403 -0.151 -0.284 -0.187 -0.916 -0.305 -0.630 -0.561 -0.650 -0.302 -0.195 -0.104 -0.168
75% 585000.000 4.000 2.500 2370.000 9879.000 2.000 0.0 0.000 4.000 8.000 480.000 1997.000 0.0 98118.000 2240.000 9600.000 0.0 0.0 7.771 7.714 0.677 0.500 0.316 -0.126 0.937 -0.305 0.907 0.292 0.442 0.370 -0.116 0.518 0.534
max 1280000.000 6.000 4.250 4530.000 134489.000 3.000 0.0 2.000 5.000 10.000 1540.000 2015.000 0.0 98199.000 3920.000 93825.000 0.0 0.0 8.418 8.274 2.836 2.775 2.668 2.883 2.790 2.311 2.444 1.996 2.852 2.822 2.972 2.046 2.243
df.describe().round(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000 21191.000
mean 507010.292 3.356 2.088 2032.486 14826.730 1.487 0.004 0.205 3.408 7.605 275.046 1970.927 65.267 98078.290 1960.389 12631.721 0.004 0.033 7.535 7.529 -0.018 -0.037 -0.052 -0.007 -0.013 -0.037 -0.003 -0.045 -0.024 -0.038 -0.005 -0.038 -0.032
std 259462.210 0.918 0.736 836.739 40400.947 0.538 0.061 0.707 0.649 1.109 422.581 29.285 354.984 53.407 655.151 27193.757 0.061 0.178 0.410 0.320 0.991 0.957 0.911 0.976 0.997 0.924 0.997 0.945 0.961 0.956 0.997 0.967 0.976
min 78000.000 1.000 0.500 370.000 520.000 1.000 0.000 0.000 1.000 3.000 0.000 1900.000 0.000 98001.000 399.000 651.000 0.000 0.000 5.914 5.989 -2.562 -2.101 -1.863 -0.352 -0.916 -0.305 -3.704 -3.970 -0.650 -2.317 -0.444 -3.860 -4.735
25% 320000.000 3.000 1.500 1410.000 5005.500 1.000 0.000 0.000 3.000 7.000 0.000 1951.000 0.000 98033.000 1480.000 5080.000 0.000 0.000 7.251 7.300 -0.403 -0.801 -0.730 -0.244 -0.916 -0.305 -0.630 -0.561 -0.650 -0.739 -0.282 -0.706 -0.732
50% 447000.000 3.000 2.250 1890.000 7560.000 1.000 0.000 0.000 3.000 7.000 0.000 1975.000 0.000 98065.000 1820.000 7576.000 0.000 0.000 7.544 7.507 -0.403 0.174 -0.207 -0.182 -0.916 -0.305 -0.630 -0.561 -0.650 -0.243 -0.190 -0.015 -0.100
75% 627650.000 4.000 2.500 2500.000 10490.500 2.000 0.000 0.000 4.000 8.000 530.000 1996.000 0.000 98118.000 2330.000 10000.000 0.000 0.000 7.824 7.754 0.677 0.500 0.457 -0.111 0.937 -0.305 0.907 0.292 0.555 0.501 -0.101 0.644 0.654
max 1640000.000 33.000 7.500 7480.000 1651359.000 3.500 1.000 4.000 5.000 12.000 2850.000 2015.000 2015.000 98199.000 5790.000 871200.000 1.000 1.000 8.920 8.664 31.985 7.002 5.881 39.512 3.717 4.926 2.444 3.701 5.830 5.551 31.475 3.228 3.434
df_clean.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_basement yr_built yr_renovated zipcode sqft_living15 sqft_lot15 wf is_renovated sqft_living_log sqft_living15_log sca_bedrooms sca_bathrooms sca_sqft_living sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_living15 sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 1.865500e+04 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.0 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.0 18655.000000 18655.000000 18655.000000 18655.0 18655.0 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000 18655.000000
mean 4.674306e+05 3.310855 2.022661 1922.376253 9825.017261 1.475449 0.0 0.088073 3.419137 7.492951 243.632485 1971.574752 0.0 98078.346717 1889.875422 9101.479282 0.0 0.0 7.491752 7.499351 -0.067307 -0.121156 -0.172038 -0.127365 -0.034553 -0.189769 0.014314 -0.140614 -0.095687 -0.141189 -0.134077 -0.139016 -0.122589
std 2.075776e+05 0.841215 0.677349 710.717642 11390.236520 0.538919 0.0 0.393627 0.645882 0.972850 376.714254 28.967732 0.0 53.233162 578.449276 9114.839090 0.0 0.0 0.379132 0.298795 0.908167 0.880856 0.774131 0.275049 0.998608 0.514778 0.992854 0.829247 0.856539 0.844187 0.334197 0.893798 0.912463
min 8.200000e+04 1.000000 0.500000 550.000000 520.000000 1.000000 0.0 0.000000 2.000000 5.000000 0.000000 1900.000000 0.0 98001.000000 720.000000 651.000000 0.0 0.0 6.309918 6.579251 -2.562083 -2.101296 -1.666863 -0.352060 -0.915552 -0.304949 -2.167193 -2.265579 -0.649637 -1.848502 -0.443916 -2.925170 -2.932400
25% 3.100000e+05 3.000000 1.500000 1380.000000 5000.000000 1.000000 0.0 0.000000 3.000000 7.000000 0.000000 1953.000000 0.0 98033.000000 1460.000000 5000.000000 0.0 0.0 7.229839 7.286192 -0.402903 -0.800849 -0.762807 -0.243878 -0.915552 -0.304949 -0.629986 -0.560800 -0.649637 -0.768548 -0.284459 -0.756473 -0.773537
50% 4.270050e+05 3.000000 2.000000 1820.000000 7350.000000 1.000000 0.0 0.000000 3.000000 7.000000 0.000000 1975.000000 0.0 98065.000000 1780.000000 7440.000000 0.0 0.0 7.506592 7.484369 -0.402903 -0.150626 -0.283549 -0.187131 -0.915552 -0.304949 -0.629986 -0.560800 -0.649637 -0.301541 -0.194996 -0.104033 -0.168342
75% 5.850000e+05 4.000000 2.500000 2370.000000 9879.000000 2.000000 0.0 0.000000 4.000000 8.000000 480.000000 1997.000000 0.0 98118.000000 2240.000000 9600.000000 0.0 0.0 7.770645 7.714231 0.676687 0.499597 0.315524 -0.126061 0.937431 -0.304949 0.907220 0.291589 0.441744 0.369781 -0.115799 0.518469 0.533615
max 1.280000e+06 6.000000 4.250000 4530.000000 134489.000000 3.000000 0.0 2.000000 5.000000 10.000000 1540.000000 2015.000000 0.0 98199.000000 3920.000000 93825.000000 0.0 0.0 8.418477 8.273847 2.835867 2.775379 2.668248 2.882993 2.790414 2.310610 2.444427 1.996368 2.851878 2.821568 2.972328 2.045722 2.242576

Drop columns that are duplicates with transformed columns.

drop_cols = ['sqft_living_log','sqft_living15_log', 'sqft_living15', 'bedrooms', 
             'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view',
             'condition', 'grade', 'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living', 
             'sqft_lot15', 'yr_range', 'sca_sqft_living', 'sca_sqft_living15']

df_clean.drop(drop_cols,axis=1,inplace=True) 
display(df_clean.describe().round(3))
display(df_clean.shape)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
price zipcode wf is_renovated sca_bedrooms sca_bathrooms sca_sqft_lot sca_floors sca_view sca_condition sca_grade sca_sqft_basement sca_sqft_lot15 sca_sqft_living_log sca_sqft_living15_log
count 18655.000 18655.000 18655.0 18655.0 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000 18655.000
mean 467430.582 98078.347 0.0 0.0 -0.067 -0.121 -0.127 -0.035 -0.190 0.014 -0.141 -0.096 -0.134 -0.139 -0.123
std 207577.583 53.233 0.0 0.0 0.908 0.881 0.275 0.999 0.515 0.993 0.829 0.857 0.334 0.894 0.912
min 82000.000 98001.000 0.0 0.0 -2.562 -2.101 -0.352 -0.916 -0.305 -2.167 -2.266 -0.650 -0.444 -2.925 -2.932
25% 310000.000 98033.000 0.0 0.0 -0.403 -0.801 -0.244 -0.916 -0.305 -0.630 -0.561 -0.650 -0.284 -0.756 -0.774
50% 427005.000 98065.000 0.0 0.0 -0.403 -0.151 -0.187 -0.916 -0.305 -0.630 -0.561 -0.650 -0.195 -0.104 -0.168
75% 585000.000 98118.000 0.0 0.0 0.677 0.500 -0.126 0.937 -0.305 0.907 0.292 0.442 -0.116 0.518 0.534
max 1280000.000 98199.000 0.0 0.0 2.836 2.775 2.883 2.790 2.311 2.444 1.996 2.852 2.972 2.046 2.243
(18655, 17)
def histograms(df):
    plt.style.use('ggplot')
    for column in df.describe():
        fig = plt.figure(figsize=(12, 5))
        
        ax = fig.add_subplot(121)
        ax.hist(df[column], density=True, label = column+' histogram', bins=20)
        ax.set_title(column.capitalize())

        ax.legend()
        
        fig.tight_layout()
histograms(df_clean)

png

png

png

png

png

png

png

png

png

png

png

png

png

png

png

Correlation Post Transformation

corr = df_clean.corr()

def corrplot(corr,figsize=(20,20)):
    fig, ax = plt.subplots(figsize=figsize)

    mask = np.zeros_like(corr, dtype=np.bool)
    idx = np.triu_indices_from(mask)
    mask[idx] = True

    sns.heatmap(np.abs(corr),square=True,mask=mask,cmap="Blues",annot=True,ax=ax)
    ax.set_ylim(len(corr), -.5, .5)
    return fig, ax

corrplot(corr.round(3))
(<Figure size 1440x1440 with 2 Axes>,
 <matplotlib.axes._subplots.AxesSubplot at 0x1a23e06f60>)

png

df_clean.isna().sum()
price                    0
zipcode                  0
wf                       0
viewed                   0
yr_category              0
is_renovated             0
sca_bedrooms             0
sca_bathrooms            0
sca_sqft_lot             0
sca_floors               0
sca_view                 0
sca_condition            0
sca_grade                0
sca_sqft_basement        0
sca_sqft_lot15           0
sca_sqft_living_log      0
sca_sqft_living15_log    0
dtype: int64
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18655 entries, 0 to 21596
Data columns (total 17 columns):
price                    18655 non-null float64
zipcode                  18655 non-null int64
wf                       18655 non-null float64
viewed                   18655 non-null bool
yr_category              18655 non-null category
is_renovated             18655 non-null int64
sca_bedrooms             18655 non-null float64
sca_bathrooms            18655 non-null float64
sca_sqft_lot             18655 non-null float64
sca_floors               18655 non-null float64
sca_view                 18655 non-null float64
sca_condition            18655 non-null float64
sca_grade                18655 non-null float64
sca_sqft_basement        18655 non-null float64
sca_sqft_lot15           18655 non-null float64
sca_sqft_living_log      18655 non-null float64
sca_sqft_living15_log    18655 non-null float64
dtypes: bool(1), category(1), float64(13), int64(2)
memory usage: 2.3 MB

MODEL

pred1 = ['C(zipcode)', 'wf', 'is_renovated','sca_bedrooms',
         'sca_bathrooms', 'sca_sqft_lot', 'sca_floors', 'C(viewed)',
         'sca_condition', 'sca_grade', 'sca_sqft_basement', 'sca_sqft_lot15', 
         'sca_sqft_living_log', 'sca_sqft_living15_log']
f1 = '+'.join(pred1)
f1
'C(zipcode)+wf+is_renovated+sca_bedrooms+sca_bathrooms+sca_sqft_lot+sca_floors+C(viewed)+sca_condition+sca_grade+sca_sqft_basement+sca_sqft_lot15+sca_sqft_living_log+sca_sqft_living15_log'
import statsmodels.formula.api as smf
f ='price~'+f1
model = smf.ols(formula=f, data=df_clean).fit()
model.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.818
Model: OLS Adj. R-squared: 0.817
Method: Least Squares F-statistic: 1044.
Date: Tue, 03 Dec 2019 Prob (F-statistic): 0.00
Time: 12:39:23 Log-Likelihood: -2.3898e+05
No. Observations: 18655 AIC: 4.781e+05
Df Residuals: 18574 BIC: 4.787e+05
Df Model: 80
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.071e+05 4790.120 64.108 0.000 2.98e+05 3.16e+05
C(zipcode)[T.98002] 2.048e+04 8035.653 2.549 0.011 4728.722 3.62e+04
C(zipcode)[T.98003] -2650.5864 7244.747 -0.366 0.714 -1.69e+04 1.15e+04
C(zipcode)[T.98004] 5.347e+05 8493.887 62.946 0.000 5.18e+05 5.51e+05
C(zipcode)[T.98005] 3.42e+05 9011.620 37.949 0.000 3.24e+05 3.6e+05
C(zipcode)[T.98006] 2.81e+05 6840.137 41.088 0.000 2.68e+05 2.94e+05
C(zipcode)[T.98007] 2.614e+05 9114.786 28.673 0.000 2.43e+05 2.79e+05
C(zipcode)[T.98008] 2.365e+05 7388.787 32.002 0.000 2.22e+05 2.51e+05
C(zipcode)[T.98010] 9.283e+04 1.17e+04 7.949 0.000 6.99e+04 1.16e+05
C(zipcode)[T.98011] 1.49e+05 8148.824 18.279 0.000 1.33e+05 1.65e+05
C(zipcode)[T.98014] 1.161e+05 1.11e+04 10.480 0.000 9.44e+04 1.38e+05
C(zipcode)[T.98019] 9.492e+04 8488.153 11.183 0.000 7.83e+04 1.12e+05
C(zipcode)[T.98022] 2071.1733 8585.993 0.241 0.809 -1.48e+04 1.89e+04
C(zipcode)[T.98023] -1.645e+04 6285.282 -2.618 0.009 -2.88e+04 -4132.106
C(zipcode)[T.98024] 1.518e+05 1.43e+04 10.614 0.000 1.24e+05 1.8e+05
C(zipcode)[T.98027] 2.036e+05 6848.515 29.723 0.000 1.9e+05 2.17e+05
C(zipcode)[T.98028] 1.361e+05 7224.396 18.841 0.000 1.22e+05 1.5e+05
C(zipcode)[T.98029] 2.29e+05 7034.671 32.549 0.000 2.15e+05 2.43e+05
C(zipcode)[T.98030] 2515.8696 7369.955 0.341 0.733 -1.19e+04 1.7e+04
C(zipcode)[T.98031] 6600.4871 7234.928 0.912 0.362 -7580.636 2.08e+04
C(zipcode)[T.98032] 4896.2559 9421.741 0.520 0.603 -1.36e+04 2.34e+04
C(zipcode)[T.98033] 3.353e+05 6739.767 49.754 0.000 3.22e+05 3.49e+05
C(zipcode)[T.98034] 1.886e+05 6200.227 30.416 0.000 1.76e+05 2.01e+05
C(zipcode)[T.98038] 3.322e+04 6156.302 5.396 0.000 2.12e+04 4.53e+04
C(zipcode)[T.98039] 6.73e+05 3.18e+04 21.184 0.000 6.11e+05 7.35e+05
C(zipcode)[T.98040] 4.511e+05 8627.475 52.285 0.000 4.34e+05 4.68e+05
C(zipcode)[T.98042] 4170.7247 6175.481 0.675 0.499 -7933.785 1.63e+04
C(zipcode)[T.98045] 1.071e+05 8138.186 13.161 0.000 9.12e+04 1.23e+05
C(zipcode)[T.98052] 2.599e+05 6167.518 42.140 0.000 2.48e+05 2.72e+05
C(zipcode)[T.98053] 2.429e+05 6896.170 35.225 0.000 2.29e+05 2.56e+05
C(zipcode)[T.98055] 4.904e+04 7345.893 6.676 0.000 3.46e+04 6.34e+04
C(zipcode)[T.98056] 1.106e+05 6569.333 16.840 0.000 9.77e+04 1.24e+05
C(zipcode)[T.98058] 3.418e+04 6440.315 5.308 0.000 2.16e+04 4.68e+04
C(zipcode)[T.98059] 9.422e+04 6519.671 14.451 0.000 8.14e+04 1.07e+05
C(zipcode)[T.98065] 1.393e+05 7278.012 19.144 0.000 1.25e+05 1.54e+05
C(zipcode)[T.98070] 1.211e+05 1.41e+04 8.586 0.000 9.34e+04 1.49e+05
C(zipcode)[T.98072] 1.73e+05 7506.117 23.052 0.000 1.58e+05 1.88e+05
C(zipcode)[T.98074] 2.273e+05 6659.318 34.128 0.000 2.14e+05 2.4e+05
C(zipcode)[T.98075] 2.512e+05 7164.044 35.065 0.000 2.37e+05 2.65e+05
C(zipcode)[T.98077] 1.696e+05 9020.790 18.798 0.000 1.52e+05 1.87e+05
C(zipcode)[T.98092] -2.451e+04 7034.439 -3.484 0.000 -3.83e+04 -1.07e+04
C(zipcode)[T.98102] 4.47e+05 1.09e+04 41.176 0.000 4.26e+05 4.68e+05
C(zipcode)[T.98103] 3.444e+05 6238.893 55.199 0.000 3.32e+05 3.57e+05
C(zipcode)[T.98105] 4.155e+05 8312.433 49.987 0.000 3.99e+05 4.32e+05
C(zipcode)[T.98106] 1.474e+05 6995.149 21.075 0.000 1.34e+05 1.61e+05
C(zipcode)[T.98107] 3.4e+05 7553.126 45.019 0.000 3.25e+05 3.55e+05
C(zipcode)[T.98108] 1.388e+05 8216.014 16.895 0.000 1.23e+05 1.55e+05
C(zipcode)[T.98109] 4.529e+05 1.08e+04 42.082 0.000 4.32e+05 4.74e+05
C(zipcode)[T.98112] 4.922e+05 8374.363 58.779 0.000 4.76e+05 5.09e+05
C(zipcode)[T.98115] 3.442e+05 6212.814 55.397 0.000 3.32e+05 3.56e+05
C(zipcode)[T.98116] 3.124e+05 7311.028 42.724 0.000 2.98e+05 3.27e+05
C(zipcode)[T.98117] 3.397e+05 6274.794 54.142 0.000 3.27e+05 3.52e+05
C(zipcode)[T.98118] 1.888e+05 6396.076 29.523 0.000 1.76e+05 2.01e+05
C(zipcode)[T.98119] 4.266e+05 8897.780 47.945 0.000 4.09e+05 4.44e+05
C(zipcode)[T.98122] 3.388e+05 7536.457 44.956 0.000 3.24e+05 3.54e+05
C(zipcode)[T.98125] 2.142e+05 6677.784 32.071 0.000 2.01e+05 2.27e+05
C(zipcode)[T.98126] 2.174e+05 6999.163 31.056 0.000 2.04e+05 2.31e+05
C(zipcode)[T.98133] 1.668e+05 6314.873 26.417 0.000 1.54e+05 1.79e+05
C(zipcode)[T.98136] 2.705e+05 7690.360 35.174 0.000 2.55e+05 2.86e+05
C(zipcode)[T.98144] 2.592e+05 7143.839 36.277 0.000 2.45e+05 2.73e+05
C(zipcode)[T.98146] 1.256e+05 7473.583 16.805 0.000 1.11e+05 1.4e+05
C(zipcode)[T.98148] 6.504e+04 1.3e+04 5.005 0.000 3.96e+04 9.05e+04
C(zipcode)[T.98155] 1.535e+05 6501.547 23.605 0.000 1.41e+05 1.66e+05
C(zipcode)[T.98166] 1.113e+05 7909.564 14.072 0.000 9.58e+04 1.27e+05
C(zipcode)[T.98168] 7.084e+04 7342.076 9.648 0.000 5.64e+04 8.52e+04
C(zipcode)[T.98177] 2.2e+05 7928.809 27.750 0.000 2.04e+05 2.36e+05
C(zipcode)[T.98178] 6.867e+04 7582.472 9.056 0.000 5.38e+04 8.35e+04
C(zipcode)[T.98188] 4.927e+04 9219.436 5.345 0.000 3.12e+04 6.73e+04
C(zipcode)[T.98198] 2.993e+04 7504.979 3.989 0.000 1.52e+04 4.46e+04
C(zipcode)[T.98199] 3.808e+05 7449.222 51.119 0.000 3.66e+05 3.95e+05
C(viewed)[T.True] 6.675e+04 3125.243 21.360 0.000 6.06e+04 7.29e+04
wf -2.361e-10 4.28e-12 -55.158 0.000 -2.44e-10 -2.28e-10
is_renovated -4.991e-11 3.07e-12 -16.246 0.000 -5.59e-11 -4.39e-11
sca_bedrooms -1330.0650 967.157 -1.375 0.169 -3225.781 565.651
sca_bathrooms 5034.0941 1225.893 4.106 0.000 2631.232 7436.957
sca_sqft_lot 5.434e+04 4253.228 12.775 0.000 4.6e+04 6.27e+04
sca_floors -1.179e+04 993.753 -11.860 0.000 -1.37e+04 -9837.675
sca_condition 1.806e+04 719.439 25.105 0.000 1.67e+04 1.95e+04
sca_grade 5.295e+04 1299.956 40.733 0.000 5.04e+04 5.55e+04
sca_sqft_basement -2.199e+04 1073.124 -20.487 0.000 -2.41e+04 -1.99e+04
sca_sqft_lot15 -1.484e+04 3634.412 -4.084 0.000 -2.2e+04 -7718.224
sca_sqft_living_log 8.957e+04 1687.662 53.073 0.000 8.63e+04 9.29e+04
sca_sqft_living15_log 2.173e+04 1209.974 17.955 0.000 1.94e+04 2.41e+04
Omnibus: 3163.839 Durbin-Watson: 2.002
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11686.546
Skew: 0.825 Prob(JB): 0.00
Kurtosis: 6.509 Cond. No. 1.32e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.25e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Model Interpretation Looks very good except 0.169 P-value for 'sca_bedrooms'. Indicates that changes in the predictor are not associated with changes in the response. Will retry model without 'sca_bedrooms'.

import statsmodels.api as sm
residuals = model.resid
fig = sm.graphics.qqplot(residuals, dist=stats.norm, line='45', fit=True)
fig.show()

png

pred2 = ['C(zipcode)', 'wf',
         'sca_bathrooms', 'sca_sqft_lot', 'sca_floors', 'C(viewed)',
        'sca_condition', 'sca_grade', 'sca_sqft_basement', 'sca_sqft_lot15', 
         'sca_sqft_living_log', 'sca_sqft_living15_log']
f2 = '+'.join(pred2)
f2
'C(zipcode)+wf+sca_bathrooms+sca_sqft_lot+sca_floors+C(viewed)+sca_condition+sca_grade+sca_sqft_basement+sca_sqft_lot15+sca_sqft_living_log+sca_sqft_living15_log'
f ='price~'+f2
model = smf.ols(formula=f, data=df_clean).fit()
model.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.818
Model: OLS Adj. R-squared: 0.817
Method: Least Squares F-statistic: 1057.
Date: Tue, 03 Dec 2019 Prob (F-statistic): 0.00
Time: 12:39:36 Log-Likelihood: -2.3898e+05
No. Observations: 18655 AIC: 4.781e+05
Df Residuals: 18575 BIC: 4.787e+05
Df Model: 79
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.07e+05 4789.776 64.094 0.000 2.98e+05 3.16e+05
C(zipcode)[T.98002] 2.045e+04 8035.818 2.545 0.011 4698.971 3.62e+04
C(zipcode)[T.98003] -2600.5814 7244.829 -0.359 0.720 -1.68e+04 1.16e+04
C(zipcode)[T.98004] 5.346e+05 8494.078 62.942 0.000 5.18e+05 5.51e+05
C(zipcode)[T.98005] 3.419e+05 9011.448 37.937 0.000 3.24e+05 3.6e+05
C(zipcode)[T.98006] 2.81e+05 6840.206 41.080 0.000 2.68e+05 2.94e+05
C(zipcode)[T.98007] 2.61e+05 9110.741 28.644 0.000 2.43e+05 2.79e+05
C(zipcode)[T.98008] 2.361e+05 7384.745 31.973 0.000 2.22e+05 2.51e+05
C(zipcode)[T.98010] 9.306e+04 1.17e+04 7.970 0.000 7.02e+04 1.16e+05
C(zipcode)[T.98011] 1.49e+05 8148.878 18.287 0.000 1.33e+05 1.65e+05
C(zipcode)[T.98014] 1.166e+05 1.11e+04 10.528 0.000 9.49e+04 1.38e+05
C(zipcode)[T.98019] 9.517e+04 8486.392 11.215 0.000 7.85e+04 1.12e+05
C(zipcode)[T.98022] 2221.7311 8585.501 0.259 0.796 -1.46e+04 1.91e+04
C(zipcode)[T.98023] -1.648e+04 6285.409 -2.621 0.009 -2.88e+04 -4155.997
C(zipcode)[T.98024] 1.519e+05 1.43e+04 10.619 0.000 1.24e+05 1.8e+05
C(zipcode)[T.98027] 2.037e+05 6847.688 29.750 0.000 1.9e+05 2.17e+05
C(zipcode)[T.98028] 1.362e+05 7224.429 18.850 0.000 1.22e+05 1.5e+05
C(zipcode)[T.98029] 2.292e+05 7033.320 32.584 0.000 2.15e+05 2.43e+05
C(zipcode)[T.98030] 2511.9496 7370.131 0.341 0.733 -1.19e+04 1.7e+04
C(zipcode)[T.98031] 6560.9382 7235.045 0.907 0.365 -7620.413 2.07e+04
C(zipcode)[T.98032] 4658.3215 9420.378 0.494 0.621 -1.38e+04 2.31e+04
C(zipcode)[T.98033] 3.354e+05 6739.825 49.761 0.000 3.22e+05 3.49e+05
C(zipcode)[T.98034] 1.885e+05 6200.054 30.403 0.000 1.76e+05 2.01e+05
C(zipcode)[T.98038] 3.338e+04 6155.317 5.423 0.000 2.13e+04 4.54e+04
C(zipcode)[T.98039] 6.729e+05 3.18e+04 21.183 0.000 6.11e+05 7.35e+05
C(zipcode)[T.98040] 4.51e+05 8627.175 52.272 0.000 4.34e+05 4.68e+05
C(zipcode)[T.98042] 4210.4220 6175.562 0.682 0.495 -7894.246 1.63e+04
C(zipcode)[T.98045] 1.072e+05 8137.848 13.177 0.000 9.13e+04 1.23e+05
C(zipcode)[T.98052] 2.599e+05 6167.627 42.144 0.000 2.48e+05 2.72e+05
C(zipcode)[T.98053] 2.437e+05 6874.170 35.448 0.000 2.3e+05 2.57e+05
C(zipcode)[T.98055] 4.918e+04 7345.363 6.696 0.000 3.48e+04 6.36e+04
C(zipcode)[T.98056] 1.106e+05 6569.471 16.843 0.000 9.78e+04 1.24e+05
C(zipcode)[T.98058] 3.411e+04 6440.268 5.297 0.000 2.15e+04 4.67e+04
C(zipcode)[T.98059] 9.414e+04 6519.620 14.440 0.000 8.14e+04 1.07e+05
C(zipcode)[T.98065] 1.397e+05 7273.090 19.208 0.000 1.25e+05 1.54e+05
C(zipcode)[T.98070] 1.217e+05 1.41e+04 8.633 0.000 9.41e+04 1.49e+05
C(zipcode)[T.98072] 1.733e+05 7504.368 23.088 0.000 1.59e+05 1.88e+05
C(zipcode)[T.98074] 2.274e+05 6659.087 34.144 0.000 2.14e+05 2.4e+05
C(zipcode)[T.98075] 2.513e+05 7163.978 35.077 0.000 2.37e+05 2.65e+05
C(zipcode)[T.98077] 1.698e+05 9019.517 18.825 0.000 1.52e+05 1.87e+05
C(zipcode)[T.98092] -2.453e+04 7034.595 -3.487 0.000 -3.83e+04 -1.07e+04
C(zipcode)[T.98102] 4.472e+05 1.09e+04 41.193 0.000 4.26e+05 4.68e+05
C(zipcode)[T.98103] 3.446e+05 6237.793 55.236 0.000 3.32e+05 3.57e+05
C(zipcode)[T.98105] 4.155e+05 8312.548 49.980 0.000 3.99e+05 4.32e+05
C(zipcode)[T.98106] 1.474e+05 6995.115 21.065 0.000 1.34e+05 1.61e+05
C(zipcode)[T.98107] 3.402e+05 7552.619 45.041 0.000 3.25e+05 3.55e+05
C(zipcode)[T.98108] 1.389e+05 8216.062 16.903 0.000 1.23e+05 1.55e+05
C(zipcode)[T.98109] 4.531e+05 1.08e+04 42.100 0.000 4.32e+05 4.74e+05
C(zipcode)[T.98112] 4.924e+05 8373.765 58.802 0.000 4.76e+05 5.09e+05
C(zipcode)[T.98115] 3.443e+05 6212.574 55.415 0.000 3.32e+05 3.56e+05
C(zipcode)[T.98116] 3.125e+05 7309.824 42.757 0.000 2.98e+05 3.27e+05
C(zipcode)[T.98117] 3.4e+05 6272.545 54.199 0.000 3.28e+05 3.52e+05
C(zipcode)[T.98118] 1.889e+05 6395.814 29.540 0.000 1.76e+05 2.01e+05
C(zipcode)[T.98119] 4.268e+05 8896.329 47.980 0.000 4.09e+05 4.44e+05
C(zipcode)[T.98122] 3.389e+05 7536.463 44.965 0.000 3.24e+05 3.54e+05
C(zipcode)[T.98125] 2.142e+05 6677.848 32.078 0.000 2.01e+05 2.27e+05
C(zipcode)[T.98126] 2.176e+05 6996.693 31.104 0.000 2.04e+05 2.31e+05
C(zipcode)[T.98133] 1.669e+05 6314.929 26.424 0.000 1.54e+05 1.79e+05
C(zipcode)[T.98136] 2.708e+05 7687.856 35.222 0.000 2.56e+05 2.86e+05
C(zipcode)[T.98144] 2.593e+05 7143.648 36.291 0.000 2.45e+05 2.73e+05
C(zipcode)[T.98146] 1.256e+05 7473.731 16.801 0.000 1.11e+05 1.4e+05
C(zipcode)[T.98148] 6.51e+04 1.3e+04 5.010 0.000 3.96e+04 9.06e+04
C(zipcode)[T.98155] 1.534e+05 6501.541 23.596 0.000 1.41e+05 1.66e+05
C(zipcode)[T.98166] 1.113e+05 7909.731 14.069 0.000 9.58e+04 1.27e+05
C(zipcode)[T.98168] 7.088e+04 7342.175 9.654 0.000 5.65e+04 8.53e+04
C(zipcode)[T.98177] 2.202e+05 7928.432 27.768 0.000 2.05e+05 2.36e+05
C(zipcode)[T.98178] 6.855e+04 7582.217 9.041 0.000 5.37e+04 8.34e+04
C(zipcode)[T.98188] 4.91e+04 9218.799 5.326 0.000 3.1e+04 6.72e+04
C(zipcode)[T.98198] 2.992e+04 7505.149 3.986 0.000 1.52e+04 4.46e+04
C(zipcode)[T.98199] 3.811e+05 7445.980 51.183 0.000 3.67e+05 3.96e+05
C(viewed)[T.True] 6.689e+04 3123.706 21.414 0.000 6.08e+04 7.3e+04
wf 1.771e-11 1.98e-12 8.952 0.000 1.38e-11 2.16e-11
sca_bathrooms 4843.6933 1218.079 3.977 0.000 2456.147 7231.240
sca_sqft_lot 5.439e+04 4253.171 12.787 0.000 4.6e+04 6.27e+04
sca_floors -1.172e+04 992.616 -11.807 0.000 -1.37e+04 -9773.884
sca_condition 1.803e+04 719.079 25.073 0.000 1.66e+04 1.94e+04
sca_grade 5.322e+04 1285.121 41.413 0.000 5.07e+04 5.57e+04
sca_sqft_basement -2.194e+04 1072.645 -20.454 0.000 -2.4e+04 -1.98e+04
sca_sqft_lot15 -1.482e+04 3634.462 -4.078 0.000 -2.19e+04 -7695.695
sca_sqft_living_log 8.858e+04 1525.853 58.051 0.000 8.56e+04 9.16e+04
sca_sqft_living15_log 2.178e+04 1209.230 18.015 0.000 1.94e+04 2.42e+04
Omnibus: 3164.844 Durbin-Watson: 2.001
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11689.897
Skew: 0.825 Prob(JB): 0.00
Kurtosis: 6.509 Cond. No. 1.34e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.88e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Test-Split

df_clean.columns
Index(['price', 'zipcode', 'wf', 'viewed', 'yr_category', 'is_renovated',
       'sca_bedrooms', 'sca_bathrooms', 'sca_sqft_lot', 'sca_floors',
       'sca_view', 'sca_condition', 'sca_grade', 'sca_sqft_basement',
       'sca_sqft_lot15', 'sca_sqft_living_log', 'sca_sqft_living15_log'],
      dtype='object')
y = df_clean[['price']]
X = df_clean[[ 'zipcode','wf', 'sca_bathrooms', 'sca_sqft_lot', 'sca_floors', 'viewed', 'sca_condition',
             'sca_grade', 'sca_sqft_basement', 'sca_sqft_lot15', 'sca_sqft_living_log', 'sca_sqft_living15_log']]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(len(X_train), len(X_test), len(y_train), len(y_test))
14924 3731 14924 3731
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_hat_train = linreg.predict(X_train)
y_hat_test = linreg.predict(X_test)
r_squared = linreg.score(X_train, y_train)
mse_train = np.sum((y_train-y_hat_train)**2)/len(y_train)
mse_test = np.sum((y_test-y_hat_test)**2)/len(y_test)
rmse_train = np.sqrt(mse_train)
rmse_test = np.sqrt(mse_test)
print('R Squared:', r_squared)
print('Train Mean Squarred Error:', mse_train)
print('Test Mean Squarred Error:', mse_test)
print('Train Root Mean Squarred Error:', rmse_train)
print('Test Root Mean Squarred Error:', rmse_test)
R Squared: 0.4826076578126131
Train Mean Squarred Error: price    2.215490e+10
dtype: float64
Test Mean Squarred Error: price    2.205471e+10
dtype: float64
Train Root Mean Squarred Error: price    148845.220082
dtype: float64
Test Root Mean Squarred Error: price    148508.288357
dtype: float64

INTERPRET & RECOMMENDATIONS

Using the final model can predict house values with approximately 80% confidence of variability of the response data around its mean.

The most important variables when trying to improve the value of a home are: zipcode, square feet of living space, grade, and square feet of lot.

For future work, it would be interesting to see how school district influences prices of homes. Other future projects might include looking at how home values fluctuate over a period of several years and how the economy has an influence on when and how much people are willing to pay for a home.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%