Machine Learning: Predicting housing prices using advanced regression technique (Random Forest)
Tools: Python (Numpy, Pandas, Seaborn, Matplotlib, Scikit -learn)
Data Source: Kaggle (training/test data)
CONTENT
- Exploratory Data Analysis
- Data Cleaning and Feature Selection
- Machine Learning Model Building / Training
- Prediction / Accuracy
EXPLORATORY DATA ANALYSIS
Before training and testing a machine learning model, it is important to understand the data to be used. This is the purpose of exploratory data analysis. The training dataset consist of 79 explanatory variables and 1 prediction variable (Sale Price) describing every aspect of redsidential homes in Ames, Iowa. By careful examination and preprecessing, relevant features will be selected and used to train a model to predict the final selling price of a home.
I seperated the variables into categorical variable and numerical variables for accurate statistical analysis. A total of 34 numerical and 43 categorical features where classified. Categorical variables where converted to numerical variables using label encoding for easy processing.
MISSING VALUES
Missing values where identified in the dataset. For the categorical features, missing values where addressed by assigning 0 to Null values. However, missing values for numerical variables where replace with the mean value.
STATISTICAL ANALYSIS AND FEATURE SELECTION
In other to select the relevant features for prediction, it is important to identity features that have a strong correlation with housing sale price. For the numerical variables, this was done by using a pearson correlation heatmap.
Visualizing the correlation between the numerical features and sales price using the correlation heatmap above, I observed 10 numerical variables with a high correlation of at least 0.5 with housing sale price.
Features | Correlation |
---|---|
OverallQual | 0.790982 |
GrLivArea | 0.708624 |
GarageCars | 0.640409 |
GarageArea | 0.623431 |
TotalBsmtSF | 0.613581 |
1stFlrSF | 0.605852 |
FullBath | 0.560664 |
TotRmsAbvGrd | 0.533723 |
YearBuilt | 0.522897 |
YearRemodAdd | 0.507101 |
Overall material and finish of the house has the highest correlation. This makes alot of sense because houses with higher quality finishes will cost more. The next feature with high correlation is the above ground living area. The scattered plot over OverallQual shows a linear in increase in quality with sales price.
For the categorical variables, the ANOVA test was carried out to study statistical significance of the variables with housing sales price.
OUTLIERS
An outlier is a point in the dataset that is distant from all other observations. I used a scatter plot to visualize the outliers. From the scattered plot of GrLivArea below, we can note the outlier which represents a significant decrease in sale price with increase in GrLivArea. This appears to be an anomaly.
Using the standard deviation method, I removed outliers that fall outside 3 standard deviations of the feature variable.The table below shows the number of outlier and non outlier odservations for the numerical features
Features | Outliers | Non Outliers |
---|---|---|
OverallQual | 2 | 1458 |
GrLivArea | 16 | 1444 |
GarageCars | 0 | 1460 |
GarageArea | 7 | 1453 |
TotalBsmtSF | 10 | 1450 |
1stFlrSF | 12 | 1448 |
FullBath | 0 | 1460 |
TotRmsAbvGrd | 12 | 1448 |
YearBuilt | 6 | 1454 |
YearRemodAdd | 6 | 1460 |
MODEL TRAINING AND PREDICTION
After cleaning the dataset and selecting relevant features, its time to train the model using sk-learn (Random Forest)
The table below shows predicted sales price for the first 6 rows using the test data.
ID | SalePrice |
---|---|
1461 | 122770 |
1462 | 145219 |
1463 | 169063 |
1464 | 182893 |
1465 | 219604 |
1466 | 179108 |
Root Mean Square Error: 1354.56
Accurary: 0.15982