Deducing Investment Opportunity for a Real Estate Investment Company

As phase 4 project of Flatiron Data Science Bootcamp.

Student pace: Full Time
Scheduled project review date/time: June 24, 2021, 05:00 PM [DST]
Instructor name: James Irving

OVERVIEW

New York City is among the most expensive and competitive housing markets in the USA. It was impacted severely by the COVID-19 with high job loss. NYC is among the top impacted areas of the country. New York has been recovering from the economic impacts of the pandemic as of mid 2021. The strong buyer demand has also changed the dynamics of the residential real estate sales market that had been cooling for nearly three years.

NYC, however, is still a buyer's real estate market and buyers may have an opportunity to get some heavy discounts.

Many industry experts have been predicting a strong property appreciation in New York starting from 2021. 2021 is should be a great year for property owners. Different business sectors have been opening up in different ways and at differing speeds with relaxing COVID-19 policies. The current trends show that the New York housing market will be hyperactive in the peak home-buying season.

Home prices are still low compared to where they were last year, just before the pandemic hit New York City. Most buyers aren't paying sellers' asking prices. In April 2021, the New York real estate market (statewide) showed strong sales due to pent-up buyer demand, according to the most recent housing report released by the New York State Association of REALTORS®. Closed and pending sales remained strong in April of 2021, marking the eighth consecutive month of sales growth in year-over-year comparisons. Since 2012, the NYC home values have appreciated by nearly 52% as per Zillow Home Value Index.

This makes New York as one of the best real estate market for homes to get into as the house prices are relatively low, high buyer power and huge inventory of homes for sale to choose from and a projected uptrend in price leading to higher return on investment.

Ref: Norada Real Estate, NY Post.

BUSINESS PROBLEM

XYZ, Inc. LLC is a (read: fictional) private equity investment company based on Queens, New York. They want to invest in the housing market for relatively short term, three years. They want to isolate and invest in properties with the highest return on investment potential based on geographical location close to their operation base in Queens, as they want to cluster their investment based on location. For this analysis, all 55 zipcodes of Queens county of New York city, NY were considered.

This analysis will recommend top five zipcodes with with return on investment potential with some insights, which will aid the top management of the company to make an educated decision on where to invest.

Source: image generated by author using plotly, and online gif maker.

Methodology

Zillow House Value dataset is used. (more info on OBTAIN section)
Data Science Process of the O.S.E.M.N. framework is adapted for this analysis
Several analysis techniques were used such as conventional time series method such as ARIMA and SARIMAX by statsmodels on all zipcodes.
not including white nosie or random walk models.
forecasting procedure implemented by Facebook, Inc. named Prophet for a handful of zipcodes, can be found in APPENDIX.
Implementation of recurrent neural network (RNN - LTMS and GRU) and transfer learning (combining SARIMAX and RNN) is a work in progress.

IMPORTS

custom functions are used, can be found in ./imports_and_functions/functions.py
most of the imports and notebook formatting used in this analysis is in ./imports_and_functions/packages.py
those are also available in the APPENDIX section.

OBTAIN

Main dataset:
- Zillow Home Value Index (ZHVI): A smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range. This data is used for the Time Series analysis obtained form Zillow Research. This data is separated by zipcode. A copy of that file renamed as zillow_raw_2021.csv can be found here. Explanation of methodology can be found here.
GeoJson:
- GeoJson file used to generate map is sourced from here provided by Open Data Delaware. A copy of that can be found at ./data/ny_new_york_zip_codes_geo.min.json in this repository.
Zipcodes with Neighborhood information
- This file was obtained from here. A copy of this can be found at ./data/nyc-zip-codes.txt in this repository.

Zillow Dataset information

Column Name	Expaination	Range
RegionID	Unique Region Identifyer	from 58001 to 753844
SizeRank	Ranked by Population	from 0 to 35187
RegionName	Zipcode	30842 unique values
RegionType	Type of location	constant value of "Zip"
StateName	Name of State	51 unique values
State	Name of State	51 unique values
City	City name	15005 unique values
Metro	Metromoliton area	862 unique values
CountyName	Name of county	1758 unique values

Rest of the colums	dates	from Jan 31, 1996 to Apr 30, 2021

SCRUB & EXPLORE

Focusing only on 55 zip codes in Queens County, New York.

EDA

Average home price by zipcode

Typical house price in Queens ranges from 77k to just over 960k. Mean price is $375207. Mean 25th quantile is $285044 and $457253 is 75th quantile. Zip code 11363 has the highest value and 11692 has the lowest property value.

Recent trend

House price increased till 2008 and then fell because of the global financial crisis, caused by subprime mortgage crisis that lead to a global recession. It did not recover till 2015-16. Although the recovery process stated from 2010. Recently the market is booming once again reaching new high.

Three year ROI

ROI is negative for only a few of the zip codes

11101
11436
11366

Highest ROI Zip code:

11104
11692
11693

This makes Queens County NY a relatively safe investment region for real estate for housing market.

For modeling, list of all zipcodes is reduced to zipcodes that exhibit return on investment more than 10% for the past three years. Found such 24 Zip Codes, those are: 11434, 11691, 11435, 11104, 11413, 11420, 11414, 11412, 11419, 11433, 11423, 11369, 11694, 11422, 11417, 11427, 11692, 11429, 11411, 11426, 11428, 11693, 11004, 11436.

Map of zipcodes

This is a visual representation of the zip codes based on the mean typical house value of Queens County NY. The bubbles are mostly the same size, meaning that they share some similar properties across the zip codes.

MODEL

Model on test Zipcode

grid searching using pmdarima

BEST MODEL

Grid searching using pyramidarima for best p, d, q, P, D, Q, m for using in a SARIMA model using predefined conditions and shows model performance for predicting in the future.

Predefined parameters:

d and D is calculated using ndiffs using 'adf'(Augmented Dickey–Fuller test for Unit Roots) for d and 'ocsb' (Osborn, Chui, Smith, and Birchenhall Test for Seasonal Unit Roots) for D.
parameters for auto_arima model:
start_p = 0; The starting value of p, the order (or number of time lags) of the auto-regressive (“AR”) model.
d = d; The order of first-differencing,
start_q = 0; order of the moving-average (“MA”) model,
max_p = 3, max value for p
max_q = 3, max value for q
start_P = 0; the order of the auto-regressive portion of the seasonal model,
D = D; The order of the seasonal differencing,
start_Q = 0; the order of the moving-average portion of the seasonal model,
max_P = 3, max value of P
max_Q = 3, max value for Q
m = 12; The period for seasonal differencing, refers to the number of periods in each season.,
seasonal = True; this data is seasonal,
stationary = False; data is not stationary,
information_criterion = 'oob', optimizing on out-of-bag sample validation on a scoring metric, other information criterias did not perform well
out_of_sample_size = 12, step hold out for validation,
scoring = 'mse', validation metric,
method = 'lbfgs'; limited-memory Broyden-Fletcher-Goldfarb-Shanno with optional box constraints, BFGS is in the family of quasi-Newton-Raphson methods that approximates the bfgs using a limited amount of computer memory.

all other parameters were left at default.

SAMPLE OF THE PROCESS

Best model:  ARIMA(1,2,2)(2,0,0)[12] intercept
Total fit time: 54.273 seconds

===========================
Model Diagonostics of 11417
===========================

SARIMAX Results

Dep. Variable:	y	No. Observations:	243
Model:	SARIMAX(1, 2, 2)x(2, 0, [], 12)	Log Likelihood	-2108.641
Date:	Fri, 18 Jun 2021	AIC	4231.282
Time:	13:51:27	BIC	4255.676
Sample:	0	HQIC	4241.110
	- 243
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
intercept	13.6332	41.466	0.329	0.742	-67.639	94.905
ar.L1	0.7545	0.377	2.000	0.045	0.015	1.494
ma.L1	-0.7661	0.384	-1.995	0.046	-1.519	-0.013
ma.L2	-0.0172	0.015	-1.144	0.253	-0.047	0.012
ar.S.L12	-0.0190	0.011	-1.683	0.092	-0.041	0.003
ar.S.L24	0.0097	0.058	0.166	0.868	-0.104	0.124
sigma2	2.252e+06	1.23e+05	18.370	0.000	2.01e+06	2.49e+06

Ljung-Box (L1) (Q):	14.92	Jarque-Bera (JB):	355.62
Prob(Q):	0.00	Prob(JB):	0.00
Heteroskedasticity (H):	1.54	Skew:	-0.59
Prob(H) (two-sided):	0.05	Kurtosis:	8.83

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

=================================
Performance on test data of 11417
=================================
Root Mean Squared Error of test and prediction: 35232.19339655126
Mean Squared Error: 1241307451.5319903
Mean Absolute Error: 31029.564416600693

=================================
Forecast of 11417
=================================

	zipcode	mean_forecasted_roi	lower_forecasted_roi	upper_forecasted_roi	std_forecasted_roi
0	11417	18.57	-29.81	66.95	48.38

Model looks good in fitting and predicting with some long tailed residuals at both end. It can capture the future but with less certainty. This is expected as determinant house price is a combination of other factors which were not considered, e.g., loan interest rate, recent development and other external factors.

I am going to consider these parameters as the best one for this type of model. This can be improved by using SARIMAX model by using some of those factors as exog, but this increased model complexity and data needed for model as the exog's true data or a proxy is needed for prediction in the future.

All Zipcodes

This process is run on a loop for all the zipcodes and results saved and used for the next part of the analysis.

High return Zipcodes

Criteria for selecting best zipcode:

Return on investment after three years

Cost is assumed to be the last true value of the median price of the zipcode, i.e., value on April 30, 2021. And revenue is assumed to the mean forecasted value after three years, i.e., 36 steps in the future. Then standard deviation is taken of the return on investment on upper confidence level and lower confidence level respectively as a proxy of risk of investment.

Top five zipcodes based on best 15 ROI and then selecting top 5 of the based on lowest risk, i.e., the risk proxy mentioned above.

	mean_forecasted_roi	std_forecasted_roi
zipcode
11429	22.230379	45.001385
11428	34.241021	47.287097
11427	30.558596	47.554807
11423	39.547861	48.105119
11417	18.564516	48.358635

Visual

INTERPRET

Best investment opportunities

All of them looks similar. They all should be a good investment and they are not expected to go under support level one, they are likely to breach resistance two soon if the current trend persists. Details about support and resistance is both on the presentation and notebook in this repo.

Run fig_dash.py for an interactive dashboard from the location ./model/fog_dash.py containing forecast for all the zip codes.

RECOMMENDATION

Invest in following zip codes:

11369
11429
11420
11428
11426

Stay away from these, they are in a bubble :

11693
11415

Rule of thumb

Go southeast part of Queens for good investment opportunity.
Some of the house are overvalued, and awaits correction, be careful of those houses.
For maximum return
- Sell at the beginning of a year
- Buy towards the end of a year

CONCLUSION

Although modeling process is adequate, there are some caveats.

This analysis does not consider Time Value of Money, one of major driver for any financial decision making process.
Model generalization can be a issue. Analysis of individual models were not performed. All of the model were run on on a loop and then searched for possible issue based of different metrics, e.g., RMSE, true versus prediction accuracy.
In general time series models are heavily contingent on model train test split, and recent trend. All of models were split on by 80-20 train-test ratio. There might be so issue of such generalization present in some of the model. Two of them were identified and dealt with, but without significant change in decision criteria. There might be some unidentified ones.

NEXT STEPS

Add variables to model, for using using a SARIMAX model
- Interest rate
- Economical indicators
- Other qualitative indicators, e.g., school, public transport access.
Try other models
- RNN
- Prophet
- Use transfer learning

REPOSITORY STRUCTURE

├── README.md                                             # readme file
├── assets                                                # image files and backups
│   ├── ... 
├── data                                                  # data used for analysis
│   ├── lat_long.csv                                      # location info
│   ├── ny_new_york_zip_codes_geo.min.json                # GeoJSON file
│   ├── nyc-zip-codes.txt                                 # zipcodes with neighbourhood information
│   └── zillow_raw_2021.csv                               # primary data source
├── imports_and_functions                                 # local package
│   ├── __init__.py
│   ├── functions.py                                      # custom functions
│   └── packages.py                                       # imports used in the notebook
├── model
│   ├── all_models_output.joblib                          # saved results
│   ├── fig_dash.py                                       # Dash dashboard
│   ├── ind_model                                         # saved individual models by zipcode
│   │   ├── 11004.joblib
│   │   ├── 11104.joblib
│   │   ├── 11369.joblib
│   │   ├── 11411.joblib
│   │   ├── 11412.joblib
│   │   ├── 11413.joblib
│   │   ├── 11414.joblib
│   │   ├── 11417.joblib
│   │   ├── 11419.joblib
│   │   ├── 11420.joblib
│   │   ├── 11422.joblib
│   │   ├── 11423.joblib
│   │   ├── 11426.joblib
│   │   ├── 11427.joblib
│   │   ├── 11428.joblib
│   │   ├── 11429.joblib
│   │   ├── 11433.joblib
│   │   ├── 11434.joblib
│   │   ├── 11435.joblib
│   │   ├── 11436.joblib
│   │   ├── 11691.joblib
│   │   ├── 11692.joblib
│   │   ├── 11693.joblib
│   │   └── 11694.joblib
│   ├── results_upd.joblib                                # updated all_models_output
│   ├── roi.joblib                                        # ROI information
│   ├── roi_upd.joblib                                    # updated ROI information
│   ├── ts.joblib                                         # cleaned and processed time series
│   └── viz.joblib
├── presentation.pdf                                      # presentation file
├── presentation.pptx                                     # presentation file
├── analysis.ipynb                                         # Main notebook used
└── analysis_55_zipcodes.ipynb                             # additional models

For additional info contact me via linkdin.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
data		data
imports_and_functions		imports_and_functions
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
analysis_55_zipcodes.ipynb		analysis_55_zipcodes.ipynb
presentation.pdf		presentation.pdf
presentation.pptx		presentation.pptx

License

tamjid-ahsan/Investment-Opportunity-for-Real-Estate-Company

Folders and files

Latest commit

History

Repository files navigation