Skip to content

MariaClaraMendes/Regression-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Regression Project

In this case is possible to work through a study regression predictive modeling problem in Python including each step of the applied machine learning process. Some steps of the process:

  • How to use data transforms to improve model performance.
  • How to use algorithm tunning to improve model performance.
  • How to use ensemble methods and tunning of ensemble methods to improve model performance.

Problem Definition

For this project the Boston House Price dataset was investigated. Each record in the dataset describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Area (SMSA) in 1970.

The attributes are defined as follows (taken from the UCI Machine Learning Repository: http://lib.stat.cmu.edu/datasets/boston):

  1. CRIM: per capita crime rate by town
  2. ZN: proportion of residencial land zoned for lots over 25,000 sq.ft.
  3. INDUS: proportion of non-retail business acres per town
  4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX: nitric oxides concentration (parts per 10 million)
  6. RM: average number of rooms per dwelling
  7. AGE: proportion of owner-occupied units built prior to 1940
  8. DIS: weighted distances to five Boston employment centers
  9. RAD: index of accessibility to radial highways
  10. TAX: full-value property-tax rate per $10,000
  11. PTRATIO: pupil-teacher ratio by town
  12. B: 1000(Bk − 0:63)2 where Bk is the proportion of blacks by town
  13. LSTAT: % lower status of the population
  14. MEDV: Median value of owner-occupied homes in $1000s

Summary of the Jupyter Notebook

  • Problem Definition (Boston house price data).
  • Loading the Dataset.
  • Analyze Data (some skewed distributions and correlated attributes).
  • Evaluate Algorithms (Linear Regression looked good).
  • Evaluate Algorithms with Standardization (KNN looked good).
  • Algorithm Tuning (K=3 for KNN was best).
  • Ensemble Methods (Bagging and Boosting, Gradient Boosting looked good).
  • Tuning Ensemble Methods (getting the most from Gradient Boosting).
  • Finalize Model (use all training data and confirm using validation dataset)