Skip to content

tamjid-ahsan/Investment-Opportunity-for-Real-Estate-Company

 
 

Repository files navigation

Deducing Investment Opportunity for a Real Estate Investment Company

By: Tamjid Ahsan


As phase 4 project of Flatiron Data Science Bootcamp.

  • Student pace: Full Time
  • Scheduled project review date/time: June 24, 2021, 05:00 PM [DST]
  • Instructor name: James Irving

OVERVIEW


New York City is among the most expensive and competitive housing markets in the USA. It was impacted severely by the COVID-19 with high job loss. NYC is among the top impacted areas of the country. New York has been recovering from the economic impacts of the pandemic as of mid 2021. The strong buyer demand has also changed the dynamics of the residential real estate sales market that had been cooling for nearly three years.

NYC, however, is still a buyer's real estate market and buyers may have an opportunity to get some heavy discounts.

Many industry experts have been predicting a strong property appreciation in New York starting from 2021. 2021 is should be a great year for property owners. Different business sectors have been opening up in different ways and at differing speeds with relaxing COVID-19 policies. The current trends show that the New York housing market will be hyperactive in the peak home-buying season.

Home prices are still low compared to where they were last year, just before the pandemic hit New York City. Most buyers aren't paying sellers' asking prices. In April 2021, the New York real estate market (statewide) showed strong sales due to pent-up buyer demand, according to the most recent housing report released by the New York State Association of REALTORS®. Closed and pending sales remained strong in April of 2021, marking the eighth consecutive month of sales growth in year-over-year comparisons. Since 2012, the NYC home values have appreciated by nearly 52% as per Zillow Home Value Index.

This makes New York as one of the best real estate market for homes to get into as the house prices are relatively low, high buyer power and huge inventory of homes for sale to choose from and a projected uptrend in price leading to higher return on investment.

Ref: Norada Real Estate, NY Post.


BUSINESS PROBLEM


XYZ, Inc. LLC is a (read: fictional) private equity investment company based on Queens, New York. They want to invest in the housing market for relatively short term, three years. They want to isolate and invest in properties with the highest return on investment potential based on geographical location close to their operation base in Queens, as they want to cluster their investment based on location. For this analysis, all 55 zipcodes of Queens county of New York city, NY were considered.


This analysis will recommend top five zipcodes with with return on investment potential with some insights, which will aid the top management of the company to make an educated decision on where to invest.


head!


Source: image generated by author using plotly, and online gif maker.


Methodology


  • Zillow House Value dataset is used. (more info on OBTAIN section)

  • Data Science Process of the O.S.E.M.N. framework is adapted for this analysis

  • Several analysis techniques were used such as conventional time series method such as ARIMA and SARIMAX by statsmodels on all zipcodes.

  • not including white nosie or random walk models.

  • forecasting procedure implemented by Facebook, Inc. named Prophet for a handful of zipcodes, can be found in APPENDIX.

  • Implementation of recurrent neural network (RNN - LTMS and GRU) and transfer learning (combining SARIMAX and RNN) is a work in progress.


IMPORTS


  • custom functions are used, can be found in ./imports_and_functions/functions.py
  • most of the imports and notebook formatting used in this analysis is in ./imports_and_functions/packages.py
  • those are also available in the APPENDIX section.

OBTAIN

  • Main dataset:
    • Zillow Home Value Index (ZHVI): A smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range. This data is used for the Time Series analysis obtained form Zillow Research. This data is separated by zipcode. A copy of that file renamed as zillow_raw_2021.csv can be found here. Explanation of methodology can be found here.
  • GeoJson:
    • GeoJson file used to generate map is sourced from here provided by Open Data Delaware. A copy of that can be found at ./data/ny_new_york_zip_codes_geo.min.json in this repository.
  • Zipcodes with Neighborhood information
    • This file was obtained from here. A copy of this can be found at ./data/nyc-zip-codes.txt in this repository.

Zillow Dataset information

Column Name Expaination Range
RegionID Unique Region Identifyer from 58001 to 753844
SizeRank Ranked by Population from 0 to 35187
RegionName Zipcode 30842 unique values
RegionType Type of location constant value of "Zip"
StateName Name of State 51 unique values
State Name of State 51 unique values
City City name 15005 unique values
Metro Metromoliton area 862 unique values
CountyName Name of county 1758 unique values
Rest of the colums dates from Jan 31, 1996 to Apr 30, 2021

SCRUB & EXPLORE

Focusing only on 55 zip codes in Queens County, New York.

EDA

Average home price by zipcode

png

Typical house price in Queens ranges from 77k to just over 960k. Mean price is $375207. Mean 25th quantile is $285044 and $457253 is 75th quantile. Zip code 11363 has the highest value and 11692 has the lowest property value.

Recent trend

png

House price increased till 2008 and then fell because of the global financial crisis, caused by subprime mortgage crisis that lead to a global recession. It did not recover till 2015-16. Although the recovery process stated from 2010. Recently the market is booming once again reaching new high.

Three year ROI

png

ROI is negative for only a few of the zip codes

  • 11101
  • 11436
  • 11366

Highest ROI Zip code:

  • 11104
  • 11692
  • 11693

This makes Queens County NY a relatively safe investment region for real estate for housing market.

For modeling, list of all zipcodes is reduced to zipcodes that exhibit return on investment more than 10% for the past three years. Found such 24 Zip Codes, those are: 11434, 11691, 11435, 11104, 11413, 11420, 11414, 11412, 11419, 11433, 11423, 11369, 11694, 11422, 11417, 11427, 11692, 11429, 11411, 11426, 11428, 11693, 11004, 11436.

Map of zipcodes

png

This is a visual representation of the zip codes based on the mean typical house value of Queens County NY. The bubbles are mostly the same size, meaning that they share some similar properties across the zip codes.

MODEL

Model on test Zipcode

grid searching using pmdarima

BEST MODEL


Grid searching using pyramidarima for best p, d, q, P, D, Q, m for using in a SARIMA model using predefined conditions and shows model performance for predicting in the future.


Predefined parameters:

  • d and D is calculated using ndiffs using 'adf'(Augmented Dickey–Fuller test for Unit Roots) for d and 'ocsb' (Osborn, Chui, Smith, and Birchenhall Test for Seasonal Unit Roots) for D.
  • parameters for auto_arima model:
  • start_p = 0; The starting value of p, the order (or number of time lags) of the auto-regressive (“AR”) model.
  • d = d; The order of first-differencing,
  • start_q = 0; order of the moving-average (“MA”) model,
  • max_p = 3, max value for p
  • max_q = 3, max value for q
  • start_P = 0; the order of the auto-regressive portion of the seasonal model,
  • D = D; The order of the seasonal differencing,
  • start_Q = 0; the order of the moving-average portion of the seasonal model,
  • max_P = 3, max value of P
  • max_Q = 3, max value for Q
  • m = 12; The period for seasonal differencing, refers to the number of periods in each season.,
  • seasonal = True; this data is seasonal,
  • stationary = False; data is not stationary,
  • information_criterion = 'oob', optimizing on out-of-bag sample validation on a scoring metric, other information criterias did not perform well
  • out_of_sample_size = 12, step hold out for validation,
  • scoring = 'mse', validation metric,
  • method = 'lbfgs'; limited-memory Broyden-Fletcher-Goldfarb-Shanno with optional box constraints, BFGS is in the family of quasi-Newton-Raphson methods that approximates the bfgs using a limited amount of computer memory.

all other parameters were left at default.

SAMPLE OF THE PROCESS

Best model:  ARIMA(1,2,2)(2,0,0)[12] intercept
Total fit time: 54.273 seconds

===========================
Model Diagonostics of 11417
===========================
SARIMAX Results
Dep. Variable: y No. Observations: 243
Model: SARIMAX(1, 2, 2)x(2, 0, [], 12) Log Likelihood -2108.641
Date: Fri, 18 Jun 2021 AIC 4231.282
Time: 13:51:27 BIC 4255.676
Sample: 0 HQIC 4241.110
- 243
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 13.6332 41.466 0.329 0.742 -67.639 94.905
ar.L1 0.7545 0.377 2.000 0.045 0.015 1.494
ma.L1 -0.7661 0.384 -1.995 0.046 -1.519 -0.013
ma.L2 -0.0172 0.015 -1.144 0.253 -0.047 0.012
ar.S.L12 -0.0190 0.011 -1.683 0.092 -0.041 0.003
ar.S.L24 0.0097 0.058 0.166 0.868 -0.104 0.124
sigma2 2.252e+06 1.23e+05 18.370 0.000 2.01e+06 2.49e+06
Ljung-Box (L1) (Q): 14.92 Jarque-Bera (JB): 355.62
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 1.54 Skew: -0.59
Prob(H) (two-sided): 0.05 Kurtosis: 8.83


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

png

=================================
Performance on test data of 11417
=================================
Root Mean Squared Error of test and prediction: 35232.19339655126
Mean Squared Error: 1241307451.5319903
Mean Absolute Error: 31029.564416600693

png

png

=================================
Forecast of 11417
=================================

png

png

zipcode mean_forecasted_roi lower_forecasted_roi upper_forecasted_roi std_forecasted_roi
0 11417 18.57 -29.81 66.95 48.38

Model looks good in fitting and predicting with some long tailed residuals at both end. It can capture the future but with less certainty. This is expected as determinant house price is a combination of other factors which were not considered, e.g., loan interest rate, recent development and other external factors.

I am going to consider these parameters as the best one for this type of model. This can be improved by using SARIMAX model by using some of those factors as exog, but this increased model complexity and data needed for model as the exog's true data or a proxy is needed for prediction in the future.

All Zipcodes

This process is run on a loop for all the zipcodes and results saved and used for the next part of the analysis.

High return Zipcodes

Criteria for selecting best zipcode:

Return on investment after three years

ROI formula

Cost is assumed to be the last true value of the median price of the zipcode, i.e., value on April 30, 2021. And revenue is assumed to the mean forecasted value after three years, i.e., 36 steps in the future. Then standard deviation is taken of the return on investment on upper confidence level and lower confidence level respectively as a proxy of risk of investment.


Top five zipcodes based on best 15 ROI and then selecting top 5 of the based on lowest risk, i.e., the risk proxy mentioned above.

mean_forecasted_roi std_forecasted_roi
zipcode
11429 22.230379 45.001385
11428 34.241021 47.287097
11427 30.558596 47.554807
11423 39.547861 48.105119
11417 18.564516 48.358635

png

Visual

static image of report

INTERPRET


Best investment opportunities

png

png

png

png

png

All of them looks similar. They all should be a good investment and they are not expected to go under support level one, they are likely to breach resistance two soon if the current trend persists. Details about support and resistance is both on the presentation and notebook in this repo.

png

Run fig_dash.py for an interactive dashboard from the location ./model/fog_dash.py containing forecast for all the zip codes.

RECOMMENDATION

Invest in following zip codes:

  • 11369
  • 11429
  • 11420
  • 11428
  • 11426

Stay away from these, they are in a bubble :

  • 11693
  • 11415

Rule of thumb

  • Go southeast part of Queens for good investment opportunity.
  • Some of the house are overvalued, and awaits correction, be careful of those houses.
  • For maximum return
    • Sell at the beginning of a year
    • Buy towards the end of a year

CONCLUSION

Although modeling process is adequate, there are some caveats.

  • This analysis does not consider Time Value of Money, one of major driver for any financial decision making process.
  • Model generalization can be a issue. Analysis of individual models were not performed. All of the model were run on on a loop and then searched for possible issue based of different metrics, e.g., RMSE, true versus prediction accuracy.
  • In general time series models are heavily contingent on model train test split, and recent trend. All of models were split on by 80-20 train-test ratio. There might be so issue of such generalization present in some of the model. Two of them were identified and dealt with, but without significant change in decision criteria. There might be some unidentified ones.

NEXT STEPS

  • Add variables to model, for using using a SARIMAX model

    • Interest rate
    • Economical indicators
    • Other qualitative indicators, e.g., school, public transport access.
  • Try other models

    • RNN
    • Prophet
    • Use transfer learning

REPOSITORY STRUCTURE

├── README.md                                             # readme file
├── assets                                                # image files and backups
│   ├── ... 
├── data                                                  # data used for analysis
│   ├── lat_long.csv                                      # location info
│   ├── ny_new_york_zip_codes_geo.min.json                # GeoJSON file
│   ├── nyc-zip-codes.txt                                 # zipcodes with neighbourhood information
│   └── zillow_raw_2021.csv                               # primary data source
├── imports_and_functions                                 # local package
│   ├── __init__.py
│   ├── functions.py                                      # custom functions
│   └── packages.py                                       # imports used in the notebook
├── model
│   ├── all_models_output.joblib                          # saved results
│   ├── fig_dash.py                                       # Dash dashboard
│   ├── ind_model                                         # saved individual models by zipcode
│   │   ├── 11004.joblib
│   │   ├── 11104.joblib
│   │   ├── 11369.joblib
│   │   ├── 11411.joblib
│   │   ├── 11412.joblib
│   │   ├── 11413.joblib
│   │   ├── 11414.joblib
│   │   ├── 11417.joblib
│   │   ├── 11419.joblib
│   │   ├── 11420.joblib
│   │   ├── 11422.joblib
│   │   ├── 11423.joblib
│   │   ├── 11426.joblib
│   │   ├── 11427.joblib
│   │   ├── 11428.joblib
│   │   ├── 11429.joblib
│   │   ├── 11433.joblib
│   │   ├── 11434.joblib
│   │   ├── 11435.joblib
│   │   ├── 11436.joblib
│   │   ├── 11691.joblib
│   │   ├── 11692.joblib
│   │   ├── 11693.joblib
│   │   └── 11694.joblib
│   ├── results_upd.joblib                                # updated all_models_output
│   ├── roi.joblib                                        # ROI information
│   ├── roi_upd.joblib                                    # updated ROI information
│   ├── ts.joblib                                         # cleaned and processed time series
│   └── viz.joblib
├── presentation.pdf                                      # presentation file
├── presentation.pptx                                     # presentation file
├── analysis.ipynb                                         # Main notebook used
└── analysis_55_zipcodes.ipynb                             # additional models

For additional info contact me via linkdin.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.8%
  • Python 0.2%