Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputation of missing values using ML models. #477

Open
vijayphugat opened this issue Sep 6, 2018 · 3 comments
Open

Imputation of missing values using ML models. #477

vijayphugat opened this issue Sep 6, 2018 · 3 comments

Comments

@vijayphugat
Copy link
Contributor

Current package impute missing values using mean and median.

Now I have identified an approach to apply Machine Learning models for imputing the missing values:

Existing approach:
Impute missing values using Mean/Median
Drawback:

  1. It reduces the variability in the data
  2. It does not preserve relationships between variables such as correlations.

New Approach:
Impute missing values as per below:

  • For Numeric features --> Linear regression model
  • For Categorical features --> Clasification model

Advantages:

  1. It will preserve the original relationships between variables
  2. It will mainatain the original variability in data

So for the datasets having large number of missing values, this approach can improve the overall quality of data to be feeded to ML algorithms.
Thus perfomance of existing model can be improved using this imputation stratgey.

@SameerMahajan-GSLab
Copy link
Contributor

@levithatcher @taylorlarsen and @mmastand do you have any input on this? Otherwise we will submit a PR for its fix soon.

@mmastand
Copy link
Contributor

mmastand commented Sep 7, 2018

We have had good luck using the following in our other work:

  • Numeric:
    • Mean
    • Bagged trees imputation
    • Assign to an outlying number such as -999 and then use a tree-based method.
  • Categorical:
    • New category, missing
    • Bagged trees

What do you think about these methods? We'd be grateful if you wanted to do a PR!

@vijayphugat
Copy link
Contributor Author

I want to do PR on this finding and currently I am using below techniques:

  • For Numeric variable : RandomForestRegressor
  • For Categorical variable : RandomForestClassifier

Both of these methods work good on linear as well as non-linear type of data.

mmastand added a commit that referenced this issue Nov 6, 2018
Imputation of missing values using ML models. (Enhancement and Bug fix opened in #477, #478)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants