- Student name: Tamjid Ahsan
- Student pace: Full Time
- Scheduled project review date/time: Apr 28, 2021, 05:00 PM [DST]
- Instructor name: James Irving
- Blog post URL: TBA
King County is located in the U.S. state of Washington. The population was 2,252,782 in the 2019 census estimate, making it the most populous county in Washington, and the 12th-most populous in the United States. The county seat is Seattle, also the state's most populous city. The county is named to honor civil rights leader Martin Luther King Jr.[1]
The county sees a USD 700K Median Listing Home Price, USD 431 Median Listing Home Price/Sq Ft, USD 766K Median Sold Home Price, with around 6000 homes listed for sale and 2800 homes for rent.[2] Homes in King County, WA sold for 2.19% above asking price on average in March 2021 with a median of 36 days days on market.
In March 2021, King County home prices were up 15.4% compared to last year. There were 3,198 homes sold in March 2021, up from 2,656 last year.[3].
This makes the county a prime market of real estate, and a paramount candidate for analysis. The namesake of Martin King Luther Jr. has a very lucrative real estate market worth exploring.
Source: image generated by author using tableau public, and online gif maker.
King County Residents that want to renovate their home to increase its resale value, but don't know what factors are important for determining a home's value. While focusing on features they can renovate, I shall discuss key factors of a home's price. This factors can be both under their constrain and beyond their control.
Those include:
- How to improve marketability.
- Focus on which aspect of the house.
- What factors to keep in mind deciding budget and required return on investment.
I am going to lossely follow OSEMiN framework for this analysis.
Data for this analysis was provided as part of phase two project of The Flatiron School Full Time Online Data Science program. This a fork of that. The csv file named kc_house_data.csv
contains following information in this repository located at ./data/kc_house_data.csv
.
GeoJson file used to get map is sourced from here provided by Open Data Delaware. A copy of that can be found at ./data/wa_washington_zip_codes_geo.min.json'
in this repository.
As the readme file accompanying the dataset.
- id - unique identified for a house
- dateDate - house was sold
- pricePrice - is prediction target
- bedroomsNumber - of Bedrooms/House
- bathroomsNumber - of bathrooms/bedrooms
- sqft_livingsquare - footage of the home
- sqft_lotsquare - footage of the lot
- floorsTotal - floors (levels) in house
- waterfront - House which has a view to a waterfront
- view - Has been viewed
- condition - How good the condition is ( Overall )
- grade - overall grade given to the housing unit, based on King County grading system
- sqft_above - square footage of house apart from basement
- sqft_basement - square footage of the basement
- yr_built - Built Year
- yr_renovated - Year when house was renovated
- zipcode - zip
- lat - Latitude coordinate
- long - Longitude coordinate
- sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
- sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors
Although explanation and data contained in view does not make any real life sense. It is very unlikely that so many house is sold without inspection, with keeping in mind that some folks buy houses for investment purpose, still some kind of representative visits and facilitate buying process. All other features are supportive of the explanation given and data contained with some extreme outliers, e.g. one house has 33 bedroom with an disproportionate sqft.
10 random sample from the dataset.
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17535 | 5285200020 | 7/18/2014 | 349950.0 | 2 | 1.75 | 1640 | 4176 | 1.0 | 0.0 | 0.0 | 4 | 7 | 1040 | 600.0 | 1948 | 0.0 | 98118 | 47.5162 | -122.260 | 1460 | 5200 |
8411 | 1924059254 | 5/8/2015 | 1300000.0 | 5 | 3.75 | 3490 | 15246 | 1.0 | 0.0 | 1.0 | 4 | 10 | 1940 | 1550.0 | 1968 | 0.0 | 98040 | 47.5479 | -122.212 | 3410 | 15682 |
7393 | 1788300020 | 7/8/2014 | 183500.0 | 3 | 1.00 | 1040 | 8892 | 1.0 | 0.0 | 0.0 | 4 | 6 | 800 | 240.0 | 1958 | 0.0 | 98023 | 47.3273 | -122.349 | 820 | 9000 |
582 | 2998800125 | 7/1/2014 | 730000.0 | 2 | 2.25 | 2130 | 4920 | 1.5 | NaN | 4.0 | 4 | 7 | 1530 | 600.0 | 1941 | 0.0 | 98116 | 47.5730 | -122.409 | 2130 | 4920 |
16843 | 6371000020 | 11/11/2014 | 380000.0 | 2 | 2.00 | 1120 | 780 | 2.0 | 0.0 | 0.0 | 3 | 8 | 760 | 360.0 | 2004 | 0.0 | 98116 | 47.5788 | -122.410 | 1120 | 1322 |
6076 | 8682231330 | 5/4/2015 | 519000.0 | 2 | 2.00 | 1560 | 4823 | 1.0 | NaN | 0.0 | 3 | 8 | 1560 | 0.0 | 2004 | 0.0 | 98053 | 47.7111 | -122.032 | 1855 | 4989 |
8444 | 6705870120 | 7/1/2014 | 739900.0 | 5 | 2.50 | 3290 | 5029 | 2.0 | 0.0 | 0.0 | 3 | 8 | 3290 | 0.0 | 2004 | 0.0 | 98075 | 47.5773 | -122.056 | 2990 | 6441 |
1505 | 2214800270 | 9/25/2014 | 355000.0 | 4 | 2.50 | 2770 | 7000 | 1.0 | 0.0 | 0.0 | 4 | 7 | 1940 | 830.0 | 1979 | 0.0 | 98001 | 47.3396 | -122.256 | 2140 | 7684 |
3473 | 7504010900 | 9/26/2014 | 598500.0 | 3 | 2.25 | 2520 | 12000 | 1.0 | NaN | 0.0 | 3 | 10 | 2520 | 0.0 | 1978 | 0.0 | 98074 | 47.6381 | -122.062 | 2510 | 12000 |
7576 | 3649100306 | 5/6/2014 | 379900.0 | 4 | 1.75 | 1500 | 11600 | 1.0 | NaN | 0.0 | 4 | 7 | 1000 | 500.0 | 1955 | NaN | 98028 | 47.7373 | -122.241 | 1740 | 11600 |
First I checked the data types.
id int64
date object
price float64
bedrooms int64
bathrooms float64
sqft_living int64
sqft_lot int64
floors float64
waterfront float64
view float64
condition int64
grade int64
sqft_above int64
sqft_basement object
yr_built int64
yr_renovated float64
zipcode int64
lat float64
long float64
sqft_living15 int64
sqft_lot15 int64
dtype: object
Observation:
Date
is recognized as object, have to convert it to date-time object to be useful.- Why
bathroom
is a float? Upon inspection, I saw that smaller bathrooms are inputted as .5 in addition to the full sized bathrooms and .25 for even smaller ones. - Above observation is also repeated for
floors
, smaller floors are recognized as .5 floor. Waterfront
is a boolean feature, still recognized as float.View
has integer value inside indicating categorical feature, still recognized as float.sqft_basement
is recognized as object. Upon inspection I found that it has some str '?' inside.
Everything else is good.
Then checked for null values:
name | is_null | not_null | |
---|---|---|---|
0 | id | 0 | 21597 |
1 | date | 0 | 21597 |
2 | price | 0 | 21597 |
3 | bedrooms | 0 | 21597 |
4 | bathrooms | 0 | 21597 |
5 | sqft_living | 0 | 21597 |
6 | sqft_lot | 0 | 21597 |
7 | floors | 0 | 21597 |
8 | waterfront | 2376 | 19221 |
9 | view | 63 | 21534 |
10 | condition | 0 | 21597 |
11 | grade | 0 | 21597 |
12 | sqft_above | 0 | 21597 |
13 | sqft_basement | 0 | 21597 |
14 | yr_built | 0 | 21597 |
15 | yr_renovated | 3842 | 17755 |
16 | zipcode | 0 | 21597 |
17 | lat | 0 | 21597 |
18 | long | 0 | 21597 |
19 | sqft_living15 | 0 | 21597 |
20 | sqft_lot15 | 0 | 21597 |
Found some NaN
values in, with my reasoning why it might be like this:
waterfront
: missing data.view
: missing data.yr_renovated
: missing data, or it was never renovated?
Next comes duplicate check.
name | duplicated | not_duplicated | |
---|---|---|---|
0 | id | 177 | 21420 |
1 | date | 21225 | 372 |
2 | price | 17975 | 3622 |
3 | bedrooms | 21585 | 12 |
4 | bathrooms | 21568 | 29 |
5 | sqft_living | 20563 | 1034 |
6 | sqft_lot | 11821 | 9776 |
7 | floors | 21591 | 6 |
8 | waterfront | 21594 | 3 |
9 | view | 21591 | 6 |
10 | condition | 21592 | 5 |
11 | grade | 21586 | 11 |
12 | sqft_above | 20655 | 942 |
13 | sqft_basement | 21293 | 304 |
14 | yr_built | 21481 | 116 |
15 | yr_renovated | 21526 | 71 |
16 | zipcode | 21527 | 70 |
17 | lat | 16564 | 5033 |
18 | long | 20846 | 751 |
19 | sqft_living15 | 20820 | 777 |
20 | sqft_lot15 | 12915 | 8682 |
Here are my observation:
feature | duplicated | not duplicated | Observation |
---|---|---|---|
id | 177 | 21420 | possible duplication. inspect it further to comment |
date | 21225 | 372 | |
price | 17975 | 3622 | indicates continuous feature |
bedrooms | 21585 | 12 | indicates categorical feature |
bathrooms | 21568 | 29 | indicates categorical feature |
sqft_living | 20563 | 1034 | indicates continuous feature |
sqft_lot | 11821 | 9776 | indicates continuous feature |
floors | 21591 | 6 | indicates categorical feature |
waterfront | 21594 | 3 | indicates categorical feature |
view | 21591 | 6 | indicates categorical feature |
condition | 21592 | 5 | indicates categorical feature |
grade | 21586 | 11 | indicates categorical feature |
sqft_above | 20655 | 942 | indicates continuous feature |
sqft_basement | 21293 | 304 | indicates continuous feature |
yr_built | 21481 | 116 | indicates continuous feature |
yr_renovated | 21526 | 71 | |
zipcode | 21527 | 70 | dataset contains data of 70 zipcodes |
lat | 16564 | 5033 | this is location info |
long | 20846 | 751 | this is location info |
sqft_living15 | 20820 | 777 | indicates continuous feature |
sqft_lot15 | 12915 | 8682 | indicates continuous feature |
Then inspected id
column and found out that, Some houses were sold more than one time, hence the duplication.
By looking at the mean, mod, and quantiles found nothing that contradicts my previous observations. Got some sense of the distribution by looking min and max. For example, price observations are widely distributed, so are all the sqft features.
Observed unusual max in bedrooms, have to isolate and observe this in the next cell.
id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
15856 | 2402100895 | 6/25/2014 | 640000.0 | 33 | 1.75 | 1620 | 6000 | 1.0 | 0.0 | 0.0 | 5 | 7 | 1040 | 580.0 | 1947 | 0.0 | 98103 | 47.6878 | -122.331 | 1330 | 4700 |
Observation: Count of bedrooms does not make sense with respect of bathrooms of 1.75; and living and lot size of 1620 and 6000; also sqft above. Indicates a erroneous input.
Dropped duplicates from id
feature, keeping first:
I could also have kept those by looking into them individually to make sure there is no actual duplicate values. As the data loss is not significant, I am dropping this keeping the first occurrence. Also could have kept the data based on last sale date, which requires additional work, filtering based on date and other features like price; and then dropping it. Not venturing that path for this analysis.
Next:
- Converted
date
to datetime dtype - Replaced missing (Nan and "?") rows to 0 in
waterfront
,view
,yr_renovated
,sqft_basement
.
My reasoning for filling with 0 is that, it is relatively safe to assume that the house does not have waterfront, was not renovated and does not have basement. Alternatively I could have simply dropped those from the dataset. Filling with mean or mode has the same drawback as filling with 0 because of the pattern of missing values, so I filled in with 0.
Then I dropped a row where bedroom was 33. It is clearly a mistake in data entry, as it does not match with the sqft of that house. Could have filled that to 3, but did not do so as the data loss is negligible.
My reasoning for filling with 0 is that, it is relatively safe to assume that the house does not have waterfront, was not renovated and does not have basement. Alternatively I could have simply dropped those from the dataset. Filling with mean or mode has the same drawback as filling with 0 because of the pattern of missing values, so I filled in with 0.
Then I dropped a row where bedroom was 33. It is clearly a mistake in data entry, as it does not match with the sqft of that house. Could have filled that to 3, but did not do so as the data loss is negligible.
After this, no NaN
, duplicates remain in the dataset. Possible error from bedrooms
has been dropped.
Ratio of price of each house over total sqft of that.
Created a boolean feature to check that Total sqft of a house is within 40% to 60% range of average of total sqft of surrounding 15 house to detect unusual house in the neighborhood. This is a check for unusual homes.
Out of 21419 values, 20679 are empty while only 740 rows containing year renovated. This is not usefull for the model. So this feature is converted to a boolean, where 1 means it is renovated and 0 means it not renovated.
Feature | Observation |
---|---|
bathrooms | High density between 1 to 4 |
bedrooms | High density between 2 to 4 |
condition | Majority falls under 3 category |
date | Sale was fairly consistant |
floors | Most of the house are either 1 or 2 storey |
grade | High density between 6 to 10 |
is_renovated | Most of the house were never renovated |
long | Should not make sense as this is just numbers representating location |
price | Very high distribution with a lot of outliers |
price_per_sqft | Almost normal looking except a very long right tail, starting at around $180 |
sqft_basement | Most of the house does not have a finished basement |
sqft_living | Almost normal looking except a very long right tail, starting at around 5000 |
sqft_living15 | Almost normal looking except a very long right tail, starting at around 4000 |
sqft_lot15 | Consists of very large outliers |
view | Most of the value has 0 views |
waterfront | Most of the house does not have watefront |
yr_built | House building is fairly consistant, except few periods when there were nationwide economic crisis. (See Price VS year built) |
yr_renovated | |
zipcode | Varying frequency of sale observed among the zipcodes |
distance_from_downtown_mile | Most of the house falls under 25 miles |
id | Should not make sense as this is just numbers representating house id |
lat | Should not make sense as this is just numbers representating location |
sqft_above | Almost normal looking except a very long right tail, starting at around 4000 |
sqft_lot | Outliers have very large value |
total_sqft_larger_than_neighbours | Very few house are outliers as defined previously |
This plot helped be to better understand the data. Outliers will be dealt with more vigorously in the later part of the analysis.
based on information from the readme file attached to the dataset.
# House identifier
unique_feat_df = ['id']
# Catagorical features
categorical_feat_df = ['waterfront', 'view', 'condition', 'grade', 'is_renovated']
# Numeric features
numerical_continuous_feat_df = [
'price', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement',
'sqft_living15', 'sqft_lot15', 'price_per_sqft',
'distance_from_downtown_mile', 'total_sqft_larger_than_neighbours',
'price_per_sqft']
numerical_discrete_feat_df = ['bedrooms', 'bathrooms', 'floors']
# Timing features
time_feat_df = ['date', 'yr_built', 'yr_renovated']
# Location feat
location_feat_df = ['lat', 'long', 'zipcode']
In the Dataframe for initial model id
, date
, yr_renovated
, and view
is not included. My reasoning for this is following:
id
: This is the identifier column.date
: Sale date is not important input for now as I am not capturing seasonality of value for this analysis.yr_renovated
: Already converted to boolean column.view
: Not sure about this feature what it really means as not much information is provided. This can also be a categorical or numeric feature.
Here is a heat map of correlation.
Observed some interesting relationship among features. Few highlights are:
- Distance has a negative relationship with price.
- sqft_living and sqft_above is positively correlated with grade.
- sqft_living is highly positively correlated with price and bathrooms.
- sqft_above and sqft_living, grade and sqft_living clearly has collinearity issue.
- sqft_above and grade, sqft_living and sqft_living15, bathrooms and sqft_living are just crossing the .75 mark.
Leaving collinear features it as it is right now. Will drop this later.
Now I tried to make sense of "view"
. It seems like it detects the aesthetic quality of the house. Not sure if it is something that a owner can manipulate, e.g. architectural design or landscaping and decoration; or its is natural, e.g. close to a park or view from window.
Higher rated house has the greatest average price. This trend is very clear.
It seems that houses closer to the downtown has a higher price per sqft but lower view as the city landscape is not pristine but proximity to business center makes them desirable. But at the next category price drops as distance increases, people are not shelling coins for sub-par property. When the distance is high means we are out of downtown with less congested area where view is better, thus increase in price. Then at the last one we are at the suburbs, where price sees a drop because of distance.
This feature is still sufficiently ambiguous to present to the public, I might be interpreting this from a myopic view and got entirely wrong.
Hoses built in the 40's to 60's have good value. A weird situation for the beginning of 70's, maybe related to a quirk of the dataset or energy crisis of that time, more domain research is needed to comment on that. High sale frequency in the early 00's to a dip in 2008 because of the global financial crisis because of subprime mortgage situation. House prices was in a bubble that created the financial crisis. This data also supports that. This data of house price and sale coincides with all financial crisis. But across the board house price is stable. Have to look into sale date to make any further comment.
I can spot a pattern reversal after bedroom count 5. Increased bedroom count has a diminishing return after a certain thresh hold, which is 5 after that reg line flattens or include error. Another point that I noticed is that, one bedrooms has a very flat line. So Anything better than one bedroom adds value to the house.
Although more house with more bedroom has higher average sale price, this is because they are usually a larger property. I could bin house size and go down that path to pin point that, but not choosing not to for the sake of time constraint.
This is a no-brainer, that house closest to the city center has a higher value. This chart helped me to spot some outliers. Later confirmed by looking at the map. Making a cap of arbitrary 40 miles cap there guided by looking at the above plot. I found 18 outliers.
I shall remove this from the dataset in later step where I remove outliers.
Most expensive houses are centered around downtown Seattle.
Not all zipcodes have similar value. This makes zipcode a important feature to select in the model for regression.
This is just a continuation of previous map. Here I used price per sqft instead of the price of the home in total to make sure that bigger house does not mess with the numbers. Now all the houses are in the same scale. This also set a bar on what price to expect when selling.
Then I decided to check the performance of IQR based and Z score based outlier removal. Based on the shape of result it seemed that IQR based removal is more appropriate here.
From the chart potential candidate for formula based outlier removal are:
- price
- bathrooms
- bathrooms
- sqft_living
- sqft_lot
- floors
- condition
- sqft_above
- sqft_basement
And for manual outlier removal:
- distance_from_downtown_mile
One sample of that process
Across the board outlier removal by using IQR is the winner. Stats for that is:
Dropping data
________________________________________________________________________________________________________________________________________________________________
Price
Data dropped: 1151
Data loss : 5.3737%
Bathrooms
Data dropped: 99
Data loss : 0.4885%
Sqft_living
Data dropped: 272
Data loss : 1.3486%
Sqft_lot
Data dropped: 2089
Data loss : 10.4991%
Floors
Data dropped: 0
Data loss : 0.0%
Condition
Data dropped: 20
Data loss : 0.1123%
Sqft_above
Data dropped: 449
Data loss : 2.5242%
Sqft_basement
Data dropped: 359
Data loss : 2.0705%
________________________________________________________________________________________________________________________________________________________________
New outlier removed dataset length: 16980
Old dataset length: 21419
Total data loss: 23.22%
There are few extremely big values in distance_from_downtown feature, which were removed. Thresh hold for this is that houses with distance of more than 40 miles were dropped. There were 18 outliers. So the final dataset size is 19975.
I used this function for majority of my regression analysis.
def OLS_sm(df,
dependant_var='price',
numeric_features=[],
categorical_features=[],
verbose=False,
show_summary=True,
show_plots=True,
target_is_dollar=False):
"""
### Uses formula based statsmodels regression for OLS. ###
Displays a statsmodels.iolib.summary.Summary object containing summary of OLS analysis.
Returns a statsmodels.regression.linear_model.RegressionResultsWrapper which can be used to access other options available.
Parameters:
===========
df = pandas.DataFrame; no default.
Input dataset to use for OLS.
dependant_var = str; default: 'price'.
Dependent variable.
numeric_features = list; default = [].
Identify numeric features.
categorical_features = list; default = [].
Identify categorical features.
verbose = boolean; default: False.
Shows the formula used and drop information.
`True` shows information.
`False` does not show information.
show_summary = boolean; default: False.
Shows summary report.
`True` shows information.
`False` does not show information.
show_plots = boolean; default: True.
Shows summary and Homoscedasticity information.
`True` shows information.
`False` does not show information.
target_is_dollar = boolean; default: True.
Modify chart axis label.
`True` modify information.
`False` does not modify information.
Under-The-Hood:
===============
--{Major Steps}--
## Regression
cate = ' + '.join([f'C({x})' for x in categorical_features])
nume = ' + '.join([f'{x}' for x in numeric_features])
formula = f'{dependant_var} ~ {nume} + {cate}'
multiple_regression = smf.ols(formula=formula, data=df).fit()
## plots
# plot on the left
sm.qqplot(multiple_regression.resid,
dist=stats.norm,
line='45',
fit=True,
ax=ax1)
# plot on the right
ax2.scatter(x=multiple_regression.fittedvalues,
y=multiple_regression.resid,
s=4,
color='gold')
Note:
=====
Make sure that every column in the DataFrame has the correct dtype.
Numeric values stored as str (i.e, object) will make stats model assume that those are categorical variable.
If Erros, check df to see if the passed feature is available in the DataFrame.
Issues:
-------
- Output control is not clear.
Changelog:
----------
- changed `resid`, was using `resid_pearson`.
- docstring update.
Next Step:
----------
- Add isolated high p_vals output based on thresh hold with verbosity display control.
-- ver: 1.3.1 --
"""
cate = ' + '.join([f'C({x})' for x in categorical_features])
nume = ' + '.join([f'{x}' for x in numeric_features])
if len(cate)==0:
formula = f'{dependant_var} ~ {nume}'
else:
formula = f'{dependant_var} ~ {nume} + {cate}'
print('Formula for the OLS model: ', formula)
# OLS regressor
multiple_regression = smf.ols(formula=formula, data=df).fit()
if verbose:
show_summary = True
show_plots = True
if show_summary:
display(multiple_regression.summary())
if show_plots:
# plotting
# plot 1
fig, (ax1,
ax2) = plt.subplots(ncols=2,
figsize=(10, 5),
gridspec_kw={'width_ratios': [0.6, 0.4]})
sm.qqplot(multiple_regression.resid,
dist=stats.norm,
line='45',
fit=True,
ax=ax1)
ax1.set_title('Q-Q Plot', fontdict={"size": 15})
# plot 2
# uses The predicted values for the original (unwhitened) design.
ax2.scatter(x=multiple_regression.fittedvalues,
y=multiple_regression.resid,
s=4,
color='gold')
if target_is_dollar:
ax2.yaxis.set_major_formatter(format_number)
ax2.set(xlabel='Predicted', ylabel='Residuals')
ax2.axhline(y=0, c='r', lw=4, ls='--')
ax2.set_title('Predicted VS Residuals', fontdict={"size": 15})
plt.suptitle('Visual Check of Residuals for Homoscedasticity',
ha='center',
va='bottom',
fontdict={"size": 25})
plt.tight_layout()
if verbose == False and show_summary == False and show_plots == True:
print('r_sq:', round(multiple_regression.rsquared, 4))
return multiple_regression
With outliers dropped data
Formula for the OLS model: price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + sqft_above + sqft_basement + yr_built + lat + long + sqft_living15 + sqft_lot15 + distance_from_downtown_mile + price_per_sqft + C(waterfront) + C(condition) + C(grade) + C(is_renovated) + C(zipcode) + C(total_sqft_larger_than_neighbours)
r_sq: 0.8654
With outlier removed data r squared is saw a improvement to very high score of 0.8654. Still there are 6 features with high p value in the model, those are grade, bedrooms, zipcodes, sqft basement, lat and long; and the residual plot still fails homoscedasticity test although a bit better than the previous one. Compared to the previous model condition now does not have a significant p value, bedroom takes its place with much more larger p value of .55 compared to condition having .08 in the previous model.
Overall, not a good model, but heading towards the right direction.
It has 6 features with high p value, and residual plots are abnormal indicating that they are not normally distributed. To tackle this issue I shall try scaling the data to see is there any room for improvement available.
I can spot some features with high collinearity. In next cell I have them spelled out.
Positive correlations:
index feature_combo correlation
0 72 sqft_above and sqft_living 0.828983
1 78 sqft_living15 and sqft_living 0.701438
2 100 sqft_lot and sqft_lot15 0.694961
3 45 bathrooms and sqft_living 0.693803
4 204 sqft_above and sqft_living15 0.676662
5 324 sqft_living15 and sqft_above 0.676662
6 177 sqft_above and grade 0.647676
7 71 sqft_living and grade 0.644038
8 311 distance_from_downtown_mile and long 0.639891
9 24 sqft_living and bedrooms 0.613042
----------------------------------------------------------------------
Negative correlations:
index feature_combo correlation
0 102 sqft_lot and price_per_sqft -0.702066
1 290 distance_from_downtown_mile and lat -0.633190
2 266 long and zipcode -0.580954
3 269 distance_from_downtown_mile and zipcode -0.558346
4 375 distance_from_downtown_mile and price_per_sqft -0.528996
5 354 price_per_sqft and sqft_lot15 -0.506349
6 17 distance_from_downtown_mile and price -0.413736
7 89 sqft_lot and floors -0.380494
8 158 yr_built and condition -0.369611
9 243 yr_built and zipcode -0.344854
sqft_above and sqft_living
is beyond .75.sqft_living15 and sqft_living
, andprice_per_sqft and sqft_lot
is around .70.
Now setting my thresh hold at .75 to drop feature. Decided to drop sqft_above
from the data set.
Dropping sqft_above
from the data set. also dropping redundant location feature lat
and long
. And for price_per_sqft
, I dropped it, after getting impact of this on the model.
# list of features for throw away models
categorical_feat_ = ['waterfront', 'condition', 'grade']
numeric_feat = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
'sqft_basement', 'yr_built', 'sqft_living15', 'sqft_lot15',
'distance_from_downtown_mile', 'total_sqft_larger_than_neighbours',
'is_renovated'
]
throwaway model, sans
price_per_sqft
Formula for the OLS model: price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + sqft_basement + yr_built + sqft_living15 + sqft_lot15 + distance_from_downtown_mile + total_sqft_larger_than_neighbours + is_renovated + C(waterfront) + C(condition) + C(grade)
r_sq: 0.6519
throwaway model with
price_per_sqft
Formula for the OLS model: price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + sqft_basement + yr_built + sqft_living15 + sqft_lot15 + distance_from_downtown_mile + total_sqft_larger_than_neighbours + is_renovated + price_per_sqft + C(waterfront) + C(condition) + C(grade)
r_sq: 0.7941
By comparing those two throw away models I can spot price_per_sqft
is leaking information (by looking at the r sq difference of almost .15, although the residual plot looks a little better), as it was derived from the target.
I am not comfortable to leave this in the model, thus dropping it for now.
Positive correlations:
index feature_combo correlation
0 63 sqft_living15 and sqft_living 0.701438
1 81 sqft_lot and sqft_lot15 0.694961
2 37 bathrooms and sqft_living 0.693803
3 59 sqft_living and grade 0.644038
4 20 sqft_living and bedrooms 0.613042
5 148 sqft_living15 and grade 0.605869
6 42 bathrooms and grade 0.584765
7 8 price and grade 0.559627
8 44 yr_built and bathrooms 0.551460
9 3 price and sqft_living 0.540199
----------------------------------------------------------------------
Negative correlations:
index feature_combo correlation
0 201 distance_from_downtown_mile and zipcode -0.558346
1 14 distance_from_downtown_mile and price -0.413736
2 73 sqft_lot and floors -0.380494
3 129 yr_built and condition -0.369611
4 181 yr_built and zipcode -0.344854
5 94 sqft_basement and floors -0.287901
6 98 floors and sqft_lot15 -0.287411
7 199 sqft_living15 and zipcode -0.282860
8 92 floors and condition -0.282272
9 79 sqft_lot and zipcode -0.254264
No multicollinear feature remains based on a threshold of .75.
On all data
This can be somewhat helpful for selecting features. This is not a foolproof way of selecting features as this does not capture the combined power of features for explanation, this is not even the preffered way. This can be used to come up with a preliminary model and iterate upon it.
price
has perfect relationship with except and can explain it fully, this is very much expected. rest of the features are of interest.
grade
,sqft_living
,zipcode
have very good explaining power. Should be present in the model.sqft_above
,sqft_living15
,bathrooms
comes next.- rest of them are also important but not as much.
All of them good relationship with price
, indicated by beta.
This can be matched with domain knowledge to further impact.
Note on p_val:
- The level of statistical significance. It is a predictor that has a low p-value is likely to be a meaningful addition to your model because of changes in the predictor's value are related to changes in the response variable. Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response, indicating those can be dropped.
- lower is better.
Note on beta:
- measures impact of change in the independent variable upon dependent variable. Both positive and negative values can provide significant insights about the relationship. Values centering around 0 does not indicate relation, thus can be dropped.
Note on Jarque–Bera test:
- Jarque–Bera testis a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution. Lower is better.
Here, I can spot some counter intuitive results as per common sense and domain knowledge. According to this grade
and condition
are not significant as those have very high p values. Where grade
has good explaining power but condition
does not.
zipcode
has 70 categories inside with varying params for each, thus it having high p values is understandable.
Up next, I am looking at features isolated which might be causing this. Their combined effect can be much different.
Time for multivariate OLS!
With the insight of everything that I did before, I came with a bag of features for my multivariate model.
Formula for the OLS model: price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + sqft_basement + yr_built + sqft_living15 + sqft_lot15 + distance_from_downtown_mile + C(waterfront) + C(condition) + C(grade) + C(is_renovated) + C(total_sqft_larger_than_neighbours) + C(zipcode)
r_sq: 0.826
After all the features I dropped in this model from the base one, I have r square of 0.827, a little lower than before but the better than the interim throw away model having 0.6519. I dropped price_per_sqft
and added zipcode
. Location has a huge impact on house price. So, I thought of having another location feature besides distance_from_downtown_miles
. This is a better model. Also my QQ plots and residual plots are better than the base model, but worse than interim throw away model because of adding zipcode
. zipcode
's can explain a lot but at the same time introduced noise. It is understandable because of the nature of the data. Houses in different zipcode
's have varying price with a general trend per zipcode but at the same time generally there are a good mix of houses with varying features in a zipcode.
grade
has a high p value so I should drop this from the next iteration. Also dropped yr_built
, was not expaining much. Leaving sqft_lot15
for now.
Formula for the OLS model: price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + sqft_basement + sqft_living15 + distance_from_downtown_mile + C(waterfront) + C(condition) + C(is_renovated) + C(zipcode)
r_sq: 0.8102
I don't have any features with high p values except some of the categories of zipcode
. Out of 70 only 9 have this issue. So, I am going to leave this in for the final model. There is also a reduction of r square value. And residual plots are practically unchanged.
I could use the grade
in the final model because of its impact on r square on some other days, for this analysis I am happy dropping it.
This is my finally selected features.
I have searched for ideal bag of features using forward selection method, RFE and RFECV. None of them recommended me a better model, although managed to get a almost normal distribution of residuals by reducing the r squared to as low as .61. Those analysis can be found on the root of this repository named student_detailed.ipynb
. I also did train-test split to check for over or under fitting and 2nd and 3rd degree polynomial regression in search of better r squared value in that notebook.
Lets use those selected features on unscaled price
to be able to interpret the coeffs better. It is observed that, scaling has little to no impact on the outcome of the model.
Same analysis and interpretation as the above one as the model result did not see any change using unscaled price
. Made me wonder, was scaling necessary or helpful for this analysis? It does, but very minimal. Not showing the results as they are mostly the same.
This model can explain 81% of data indicated by r square. None of them have significant p value. No collinearity detected. Now looking at the betas I can see the relationships:
Feature | beta | Relation | Comment | Can Control? |
---|---|---|---|---|
Intercept | 335799.05 | positive | Necessary to facilitate regression. OLS line goes through this on the y-axis | |
sqft_living | 132496.68 | positive | Should increase it, very high beta | yes |
is_renovated | 46199.95 | positive | Renovation has a good impact on the price | yes |
waterfront | 345419.15 | positive | Waterfront house see a massive boost in the house | no |
sqft_living15 | 38715.26 | positive | yes, indirectly | |
bathrooms | 16821.08 | positive | Add bathroom as it adds value | yes |
sqft_lot | 12067.89 | positive | Higher the better, has positive impact | not really |
floors | -9072.33 | negative | extra floors does not add much value unless offsetted by some other feature | yes |
bedrooms | -9275.87 | negative | extra bedrooms does not add much value unless offsetted by some other feature | yes |
sqft_basement | -24453.92 | negative | extra sqft here does not add much value unless offsetted by some other feature | yes |
distance_from_downtown_mile | -105340.49 | negative | Further you go from downtown, houses has lesser value | no |
C(condition)[T.5.0] | 94578.71 | positive | Higher condition means better value | yes |
C(condition)[T.4.0] | 62977.09 | positive | Higher condition means better value | yes |
C(condition)[T.3.0] | 42254.24 | positive | Higher condition means better value | yes |
More nuanced analysis can be done on floors
, bedroom
and sqft_basement
by considering them as categorical data for inference and provide explicit recommendation. Those could also be binned. Condition less than 3 were dropped in the outlier remeval process.
zipcodes has varying impact on the house value depending on which zipcode it is. The take away is zipcodes have strong relation with the price. Some are positive some are negative. Negative ones can expect less value.
Zipcodes | Beta | Relation |
---|---|---|
98039 | 455084.03 | positive |
98004 | 362286.19 | positive |
98112 | 261271.39 | positive |
98040 | 259455.47 | positive |
98119 | 258799.06 | positive |
98109 | 257002.13 | positive |
98102 | 247550.35 | positive |
98105 | 230676.59 | positive |
98199 | 208969.15 | positive |
98053 | 198417.47 | positive |
98029 | 193993.07 | positive |
98005 | 188249.92 | positive |
98045 | 186074.31 | positive |
98033 | 184085.66 | positive |
98107 | 183802.17 | positive |
98117 | 183437.24 | positive |
98103 | 182682.27 | positive |
98115 | 178326.06 | positive |
98075 | 166391.60 | positive |
98027 | 162088.94 | positive |
98052 | 155003.58 | positive |
98006 | 152705.06 | positive |
98116 | 148814.95 | positive |
98074 | 146855.53 | positive |
98022 | 144307.59 | positive |
98065 | 138944.82 | positive |
98122 | 137847.61 | positive |
98008 | 134675.27 | positive |
98024 | 134383.22 | positive |
98007 | 131599.48 | positive |
98136 | 126022.79 | positive |
98077 | 125642.74 | positive |
98010 | 120318.14 | positive |
98014 | 109883.72 | positive |
98177 | 102135.16 | positive |
98072 | 73509.44 | positive |
98034 | 71763.06 | positive |
98125 | 67925.08 | positive |
98144 | 64782.45 | positive |
98019 | 63135.63 | positive |
98126 | 49323.28 | positive |
98011 | 46363.11 | positive |
98038 | 41583.86 | positive |
98133 | 37155.57 | positive |
98155 | 36122.91 | positive |
98028 | 26184.43 | positive |
98002 | 18727.78 | positive |
98118 | 8518.02 | positive |
98003 | 7354.71 | positive |
98092 | 538.06 | positive |
98023 | -4601.84 | negative |
98070 | -5941.64 | negative |
98042 | -7492.64 | negative |
98166 | -9331.88 | negative |
98059 | -18538.94 | negative |
98198 | -19427.77 | negative |
98056 | -19863.63 | negative |
98030 | -28292.89 | negative |
98146 | -28929.26 | negative |
98106 | -35205.07 | negative |
98058 | -37175.44 | negative |
98032 | -37593.16 | negative |
98148 | -41163.22 | negative |
98031 | -45656.11 | negative |
98108 | -47959.54 | negative |
98055 | -51645.43 | negative |
98188 | -63133.71 | negative |
98178 | -73196.63 | negative |
98168 | -88167.59 | negative |
Note:
- If the beta coefficient is positive, the interpretation is that for every 1-unit increase in the predictor variable, the outcome variable will increase by the beta coefficient value, and vice versa.
- one category is lost from each categorical variable, unable to capture its impact on price by looking into its beta.
Q-Q plot of residuals looks good, but the scatter plot is messy, indicating some bias. This is because of the nature of the data.
Harsh outlier removal has impact on the result.
This is better from the base model, with every thing except r squared. But the base model failed in everything except that, did not even satisfy some assumptions for linar regression.
Adding more feature has a direct relation with r square. And leaving more outliers has a impact on the residuals distribution. One of the core assumptions of liner regression. Liner regression for this type of data alright, it tries to predict a specific price. In real life for this type of property use of range rather than accuracy maybe more appropriate, depending on the use case. There are a lot of models out there for prediction. For that range calculation maybe a tree based model is better or even classification. Liner regression is a widely use technique for prediction of price.
Homeowners should focus to:
- Condition of the house is important. Invest on improving elements that improves that. Focus on higher quality building materials.
- Try to increase sqft living; consider adding to the property.
- Should expect more value if closer to the city
- Grade provided by the city is important. Read the guidelines and do things that improve grade. Like better materials
- Dont be a outlier in the neighborhood in terms of living space. At least remodel to match that.
- lot size also matters. try to strike a perfect balance here.
- Bedrooms after 5 has a diminishing return. Do not focus on increasing. Instead invest on creating a better condition.
- Basement sqft has a negative relation with value. Focus else where.
- Renovation has a positive impact on price.
- Old houses sells for less. If house is old then try to get better grade and condition for better value.
Then they can observe a substantial improvement of their house value.
Notebook accompanying this analysis can be found in in the root of the repository named student.ipynb
. A pdf version is also there. For the detailed one with additional work check student_detailed.ipynb
. Presentation for this is also in the root named presentation.pptx
with a pdf version presentation.pdf
. All the data is inside /Data/
folder, labeled appropriately.
I would like to, in no particular order:
- Data collecting
- try a model with unscaled dependent variable; price and see the impact on model performance. That will help me better explain the data. But if not reporting those the the sign alone is enough.
- engineered more features. e.g. total house sqft, bedroom to bathroom ratio.
- process date as year, or split it to year and month and try to capture seasonality of value on listing timing.
- bin some features. e.g. year built by decades or something; bedrooms in single, typical, and large.
- make some boolean features. e.g. has basement.
- get some other monetary (e.g. average property price), and cultural and recreational (e.g. proximity to some stadium, local restaurant density, public transport, schools, parks) info about the locations. I am not sure how to get all those info.
- get some more info about view and include in the analysis. LOCATION does matter in the real estate business.
- Modeling
- try another method for RFE. e.g. tree based - random forest. RFECV - Recursive Feature Elimination with Cross Validation.
- use cross validation a.k.a. rotation estimation. e.g. k-fold cross-validation, Repeated random sub-sampling validation a.k.a. Monte Carlo Simulation.
- make train-test split and compare predictive power.
- use some other metric other than r square. e.g. RMSE, p value, coefficients.
- try other features in in model. go for an exhaustive search.