By: Tamjid Ahsan
As phase 4 project of Flatiron Data Science Bootcamp.
- Student pace: Full Time
- Scheduled project review date/time: June 24, 2021, 05:00 PM [DST]
- Instructor name: James Irving
New York City is among the most expensive and competitive housing markets in the USA. It was impacted severely by the COVID-19 with high job loss. NYC is among the top impacted areas of the country. New York has been recovering from the economic impacts of the pandemic as of mid 2021. The strong buyer demand has also changed the dynamics of the residential real estate sales market that had been cooling for nearly three years.
NYC, however, is still a buyer's real estate market and buyers may have an opportunity to get some heavy discounts.
Many industry experts have been predicting a strong property appreciation in New York starting from 2021. 2021 is should be a great year for property owners. Different business sectors have been opening up in different ways and at differing speeds with relaxing COVID-19 policies. The current trends show that the New York housing market will be hyperactive in the peak home-buying season.
Home prices are still low compared to where they were last year, just before the pandemic hit New York City. Most buyers aren't paying sellers' asking prices. In April 2021, the New York real estate market (statewide) showed strong sales due to pent-up buyer demand, according to the most recent housing report released by the New York State Association of REALTORS®. Closed and pending sales remained strong in April of 2021, marking the eighth consecutive month of sales growth in year-over-year comparisons. Since 2012, the NYC home values have appreciated by nearly 52% as per Zillow Home Value Index.
This makes New York as one of the best real estate market for homes to get into as the house prices are relatively low, high buyer power and huge inventory of homes for sale to choose from and a projected uptrend in price leading to higher return on investment.
Ref: Norada Real Estate, NY Post.
XYZ, Inc. LLC is a (read: fictional) private equity investment company based on Queens, New York. They want to invest in the housing market for relatively short term, three years. They want to isolate and invest in properties with the highest return on investment potential based on geographical location close to their operation base in Queens, as they want to cluster their investment based on location. For this analysis, all 55 zipcodes of Queens county of New York city, NY were considered.
This analysis will recommend top five zipcodes with with return on investment potential with some insights, which will aid the top management of the company to make an educated decision on where to invest.
Source: image generated by author using plotly, and online gif maker.
-
Zillow House Value dataset is used. (more info on OBTAIN section)
-
Data Science Process of the O.S.E.M.N. framework is adapted for this analysis
-
Several analysis techniques were used such as conventional time series method such as
ARIMA
andSARIMAX
bystatsmodels
on all zipcodes. -
not including
white nosie
orrandom walk model
s. -
forecasting procedure implemented by Facebook, Inc. named
Prophet
for a handful of zipcodes, can be found in APPENDIX. -
Implementation of recurrent neural network (RNN - LTMS and GRU) and transfer learning (combining
SARIMAX
andRNN
) is a work in progress.
- custom functions are used, can be found in
./imports_and_functions/functions.py
- most of the imports and notebook formatting used in this analysis is in
./imports_and_functions/packages.py
- those are also available in the APPENDIX section.
- Main dataset:
- Zillow Home Value Index (ZHVI): A smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range. This data is used for the Time Series analysis obtained form Zillow Research. This data is separated by zipcode. A copy of that file renamed as
zillow_raw_2021.csv
can be found here. Explanation of methodology can be found here.
- Zillow Home Value Index (ZHVI): A smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range. This data is used for the Time Series analysis obtained form Zillow Research. This data is separated by zipcode. A copy of that file renamed as
- GeoJson:
- GeoJson file used to generate map is sourced from here provided by Open Data Delaware. A copy of that can be found at
./data/ny_new_york_zip_codes_geo.min.json
in this repository.
- GeoJson file used to generate map is sourced from here provided by Open Data Delaware. A copy of that can be found at
- Zipcodes with Neighborhood information
- This file was obtained from here. A copy of this can be found at
./data/nyc-zip-codes.txt
in this repository.
- This file was obtained from here. A copy of this can be found at
Column Name | Expaination | Range |
---|---|---|
RegionID | Unique Region Identifyer | from 58001 to 753844 |
SizeRank | Ranked by Population | from 0 to 35187 |
RegionName | Zipcode | 30842 unique values |
RegionType | Type of location | constant value of "Zip" |
StateName | Name of State | 51 unique values |
State | Name of State | 51 unique values |
City | City name | 15005 unique values |
Metro | Metromoliton area | 862 unique values |
CountyName | Name of county | 1758 unique values |
Rest of the colums | dates | from Jan 31, 1996 to Apr 30, 2021 |
Focusing only on 55 zip codes in Queens County, New York.
Typical house price in Queens ranges from 77k to just over 960k. Mean price is $375207
. Mean 25th quantile is $285044
and $457253
is 75th quantile. Zip code 11363
has the highest value and 11692
has the lowest property value.
House price increased till 2008 and then fell because of the global financial crisis, caused by subprime mortgage crisis that lead to a global recession. It did not recover till 2015-16. Although the recovery process stated from 2010. Recently the market is booming once again reaching new high.
ROI is negative for only a few of the zip codes
- 11101
- 11436
- 11366
Highest ROI Zip code:
- 11104
- 11692
- 11693
This makes Queens County NY a relatively safe investment region for real estate for housing market.
For modeling, list of all zipcodes is reduced to zipcodes that exhibit return on investment more than 10% for the past three years. Found such 24 Zip Codes, those are: 11434, 11691, 11435, 11104, 11413, 11420, 11414, 11412, 11419, 11433, 11423, 11369, 11694, 11422, 11417, 11427, 11692, 11429, 11411, 11426, 11428, 11693, 11004, 11436.
This is a visual representation of the zip codes based on the mean typical house value of Queens County NY. The bubbles are mostly the same size, meaning that they share some similar properties across the zip codes.
BEST MODEL
Grid searching using pyramidarima for best p, d, q, P, D, Q, m for using in a SARIMA model using predefined conditions and shows model performance for predicting in the future.
Predefined parameters:
- d and D is calculated using ndiffs using 'adf'(Augmented Dickey–Fuller test for Unit Roots) for d and 'ocsb' (Osborn, Chui, Smith, and Birchenhall Test for Seasonal Unit Roots) for D.
- parameters for auto_arima model:
- start_p = 0; The starting value of p, the order (or number of time lags) of the auto-regressive (“AR”) model.
- d = d; The order of first-differencing,
- start_q = 0; order of the moving-average (“MA”) model,
- max_p = 3, max value for p
- max_q = 3, max value for q
- start_P = 0; the order of the auto-regressive portion of the seasonal model,
- D = D; The order of the seasonal differencing,
- start_Q = 0; the order of the moving-average portion of the seasonal model,
- max_P = 3, max value of P
- max_Q = 3, max value for Q
- m = 12; The period for seasonal differencing, refers to the number of periods in each season.,
- seasonal = True; this data is seasonal,
- stationary = False; data is not stationary,
- information_criterion = 'oob', optimizing on
out-of-bag
sample validation on a scoring metric, other information criterias did not perform well - out_of_sample_size = 12, step hold out for validation,
- scoring = 'mse', validation metric,
- method = 'lbfgs'; limited-memory Broyden-Fletcher-Goldfarb-Shanno with optional box constraints, BFGS is in the family of quasi-Newton-Raphson methods that approximates the
bfgs
using a limited amount of computer memory.
all other parameters were left at default.
SAMPLE OF THE PROCESS
Best model: ARIMA(1,2,2)(2,0,0)[12] intercept
Total fit time: 54.273 seconds
===========================
Model Diagonostics of 11417
===========================
Dep. Variable: | y | No. Observations: | 243 |
---|---|---|---|
Model: | SARIMAX(1, 2, 2)x(2, 0, [], 12) | Log Likelihood | -2108.641 |
Date: | Fri, 18 Jun 2021 | AIC | 4231.282 |
Time: | 13:51:27 | BIC | 4255.676 |
Sample: | 0 | HQIC | 4241.110 |
- 243 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | 13.6332 | 41.466 | 0.329 | 0.742 | -67.639 | 94.905 |
ar.L1 | 0.7545 | 0.377 | 2.000 | 0.045 | 0.015 | 1.494 |
ma.L1 | -0.7661 | 0.384 | -1.995 | 0.046 | -1.519 | -0.013 |
ma.L2 | -0.0172 | 0.015 | -1.144 | 0.253 | -0.047 | 0.012 |
ar.S.L12 | -0.0190 | 0.011 | -1.683 | 0.092 | -0.041 | 0.003 |
ar.S.L24 | 0.0097 | 0.058 | 0.166 | 0.868 | -0.104 | 0.124 |
sigma2 | 2.252e+06 | 1.23e+05 | 18.370 | 0.000 | 2.01e+06 | 2.49e+06 |
Ljung-Box (L1) (Q): | 14.92 | Jarque-Bera (JB): | 355.62 |
---|---|---|---|
Prob(Q): | 0.00 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 1.54 | Skew: | -0.59 |
Prob(H) (two-sided): | 0.05 | Kurtosis: | 8.83 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
=================================
Performance on test data of 11417
=================================
Root Mean Squared Error of test and prediction: 35232.19339655126
Mean Squared Error: 1241307451.5319903
Mean Absolute Error: 31029.564416600693
=================================
Forecast of 11417
=================================
zipcode | mean_forecasted_roi | lower_forecasted_roi | upper_forecasted_roi | std_forecasted_roi | |
---|---|---|---|---|---|
0 | 11417 | 18.57 | -29.81 | 66.95 | 48.38 |
Model looks good in fitting and predicting with some long tailed residuals at both end. It can capture the future but with less certainty. This is expected as determinant house price is a combination of other factors which were not considered, e.g., loan interest rate, recent development and other external factors.
I am going to consider these parameters as the best one for this type of model. This can be improved by using SARIMAX model by using some of those factors as exog, but this increased model complexity and data needed for model as the exog's true data or a proxy is needed for prediction in the future.
This process is run on a loop for all the zipcodes and results saved and used for the next part of the analysis.
Criteria for selecting best zipcode:
Return on investment after three years
Cost is assumed to be the last true value of the median price of the zipcode, i.e., value on April 30, 2021. And revenue is assumed to the mean forecasted value after three years, i.e., 36 steps in the future. Then standard deviation is taken of the return on investment on upper confidence level and lower confidence level respectively as a proxy of risk of investment.
Top five zipcodes based on best 15 ROI and then selecting top 5 of the based on lowest risk, i.e., the risk proxy mentioned above.
mean_forecasted_roi | std_forecasted_roi | |
---|---|---|
zipcode | ||
11429 | 22.230379 | 45.001385 |
11428 | 34.241021 | 47.287097 |
11427 | 30.558596 | 47.554807 |
11423 | 39.547861 | 48.105119 |
11417 | 18.564516 | 48.358635 |
Best investment opportunities
All of them looks similar. They all should be a good investment and they are not expected to go under support level one, they are likely to breach resistance two soon if the current trend persists. Details about support and resistance is both on the presentation and notebook in this repo.
Run fig_dash.py
for an interactive dashboard from the location ./model/fog_dash.py
containing forecast for all the zip codes.
Invest in following zip codes:
- 11369
- 11429
- 11420
- 11428
- 11426
Stay away from these, they are in a bubble :
- 11693
- 11415
Rule of thumb
- Go southeast part of Queens for good investment opportunity.
- Some of the house are overvalued, and awaits correction, be careful of those houses.
- For maximum return
- Sell at the beginning of a year
- Buy towards the end of a year
Although modeling process is adequate, there are some caveats.
- This analysis does not consider Time Value of Money, one of major driver for any financial decision making process.
- Model generalization can be a issue. Analysis of individual models were not performed. All of the model were run on on a loop and then searched for possible issue based of different metrics, e.g., RMSE, true versus prediction accuracy.
- In general time series models are heavily contingent on model train test split, and recent trend. All of models were split on by 80-20 train-test ratio. There might be so issue of such generalization present in some of the model. Two of them were identified and dealt with, but without significant change in decision criteria. There might be some unidentified ones.
-
Add variables to model, for using using a
SARIMAX
model- Interest rate
- Economical indicators
- Other qualitative indicators, e.g., school, public transport access.
-
Try other models
- RNN
- Prophet
- Use transfer learning
├── README.md # readme file
├── assets # image files and backups
│ ├── ...
├── data # data used for analysis
│ ├── lat_long.csv # location info
│ ├── ny_new_york_zip_codes_geo.min.json # GeoJSON file
│ ├── nyc-zip-codes.txt # zipcodes with neighbourhood information
│ └── zillow_raw_2021.csv # primary data source
├── imports_and_functions # local package
│ ├── __init__.py
│ ├── functions.py # custom functions
│ └── packages.py # imports used in the notebook
├── model
│ ├── all_models_output.joblib # saved results
│ ├── fig_dash.py # Dash dashboard
│ ├── ind_model # saved individual models by zipcode
│ │ ├── 11004.joblib
│ │ ├── 11104.joblib
│ │ ├── 11369.joblib
│ │ ├── 11411.joblib
│ │ ├── 11412.joblib
│ │ ├── 11413.joblib
│ │ ├── 11414.joblib
│ │ ├── 11417.joblib
│ │ ├── 11419.joblib
│ │ ├── 11420.joblib
│ │ ├── 11422.joblib
│ │ ├── 11423.joblib
│ │ ├── 11426.joblib
│ │ ├── 11427.joblib
│ │ ├── 11428.joblib
│ │ ├── 11429.joblib
│ │ ├── 11433.joblib
│ │ ├── 11434.joblib
│ │ ├── 11435.joblib
│ │ ├── 11436.joblib
│ │ ├── 11691.joblib
│ │ ├── 11692.joblib
│ │ ├── 11693.joblib
│ │ └── 11694.joblib
│ ├── results_upd.joblib # updated all_models_output
│ ├── roi.joblib # ROI information
│ ├── roi_upd.joblib # updated ROI information
│ ├── ts.joblib # cleaned and processed time series
│ └── viz.joblib
├── presentation.pdf # presentation file
├── presentation.pptx # presentation file
├── analysis.ipynb # Main notebook used
└── analysis_55_zipcodes.ipynb # additional models
For additional info contact me via linkdin.