PREDICTIVE ANALYSIS

Machine Learning: Predicting housing prices using advanced regression technique (Random Forest)

Tools: Python (Numpy, Pandas, Seaborn, Matplotlib, Scikit -learn)

Data Source: Kaggle (training/test data)

CONTENT

Exploratory Data Analysis
Data Cleaning and Feature Selection
Machine Learning Model Building / Training
Prediction / Accuracy

EXPLORATORY DATA ANALYSIS
Before training and testing a machine learning model, it is important to understand the data to be used. This is the purpose of exploratory data analysis. The training dataset consist of 79 explanatory variables and 1 prediction variable (Sale Price) describing every aspect of redsidential homes in Ames, Iowa. By careful examination and preprecessing, relevant features will be selected and used to train a model to predict the final selling price of a home.

I seperated the variables into categorical variable and numerical variables for accurate statistical analysis. A total of 34 numerical and 43 categorical features where classified. Categorical variables where converted to numerical variables using label encoding for easy processing.

MISSING VALUES
Missing values where identified in the dataset. For the categorical features, missing values where addressed by assigning 0 to Null values. However, missing values for numerical variables where replace with the mean value.

STATISTICAL ANALYSIS AND FEATURE SELECTION
In other to select the relevant features for prediction, it is important to identity features that have a strong correlation with housing sale price. For the numerical variables, this was done by using a pearson correlation heatmap.

Visualizing the correlation between the numerical features and sales price using the correlation heatmap above, I observed 10 numerical variables with a high correlation of at least 0.5 with housing sale price.

Features	Correlation
OverallQual	0.790982
GrLivArea	0.708624
GarageCars	0.640409
GarageArea	0.623431
TotalBsmtSF	0.613581
1stFlrSF	0.605852
FullBath	0.560664
TotRmsAbvGrd	0.533723
YearBuilt	0.522897
YearRemodAdd	0.507101

Overall material and finish of the house has the highest correlation. This makes alot of sense because houses with higher quality finishes will cost more. The next feature with high correlation is the above ground living area. The scattered plot over OverallQual shows a linear in increase in quality with sales price.

For the categorical variables, the ANOVA test was carried out to study statistical significance of the variables with housing sales price.

OUTLIERS
An outlier is a point in the dataset that is distant from all other observations. I used a scatter plot to visualize the outliers. From the scattered plot of GrLivArea below, we can note the outlier which represents a significant decrease in sale price with increase in GrLivArea. This appears to be an anomaly.

Using the standard deviation method, I removed outliers that fall outside 3 standard deviations of the feature variable.The table below shows the number of outlier and non outlier odservations for the numerical features

Features	Outliers	Non Outliers
OverallQual	2	1458
GrLivArea	16	1444
GarageCars	0	1460
GarageArea	7	1453
TotalBsmtSF	10	1450
1stFlrSF	12	1448
FullBath	0	1460
TotRmsAbvGrd	12	1448
YearBuilt	6	1454
YearRemodAdd	6	1460

MODEL TRAINING AND PREDICTION
After cleaning the dataset and selecting relevant features, its time to train the model using sk-learn (Random Forest) The table below shows predicted sales price for the first 6 rows using the test data.

ID	SalePrice
1461	122770
1462	145219
1463	169063
1464	182893
1465	219604
1466	179108

Root Mean Square Error: 1354.56

Accurary: 0.15982

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
CODES		CODES
Image		Image
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODES

CODES

Image

Image

README.md

README.md

Repository files navigation

PREDICTIVE ANALYSIS

About

Releases

Packages

Contributors 2

Languages

PrinceIgweze/Predictive-Analysis

Folders and files

Latest commit

History

Repository files navigation

PREDICTIVE ANALYSIS

About

Topics

Resources

Stars

Watchers

Forks

Languages