Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the engineered features in Autogluaon. #163

Open
vinay-k12 opened this issue Feb 28, 2023 · 6 comments
Open

Understanding the engineered features in Autogluaon. #163

vinay-k12 opened this issue Feb 28, 2023 · 6 comments

Comments

@vinay-k12
Copy link

I was using interpretrable models in autogluaon. While the model training was easier but the challenges is in understand the rules as the rules are created using engineered features and we do not have visibility on the feature engineering. For example, this was rule created when I was running on Lendclub data.

image

There is such value as '11' in 'emp_title'. So, how do we reverse transform the value '11' back to original data?

@mglowacki100
Copy link

I'm not sure if autogluon creates those names, but if you look for quick-fix, you need to one-hot encode categorical variables by yourself:

def dummification(df, col):
  dfz = pd.get_dummies(df[col], prefix=col)
  df = df.drop(columns=[col])
  return pd.concat([df, dfz], axis=1)

...
train_data = train_data.drop(columns='education-num') # education_num is just education encoded by 'ordinal'
categorical = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

for c in categorical:
  train_data = dummification(train_data, c)

with this:

predictor.print_interpretable_rules(model_name='RuleFit_3')
                                                                                                                                                                          rule  coef
                                                                                                                                                                  capital-gain  0.00
                                                                                                                      capital-gain <= 6571.5 and education_ Prof-school <= 0.5 -0.47
                                                                                                                 capital-gain <= 7073.5 and occupation_ Exec-managerial <= 0.5 -0.44
                                                                                               fnlwgt <= 260314.5 and capital-gain <= 7073.5 and education_ Prof-school <= 0.5 -0.19
                                                                                  capital-gain <= 7268.5 and education_ Bachelors <= 0.5 and occupation_ Prof-specialty <= 0.5 -0.36
                                               capital-gain <= 6571.5 and education_ Bachelors <= 0.5 and occupation_ Prof-specialty <= 0.5 and workclass_ Self-emp-inc <= 0.5 -0.85
                                                                                                                                        age <= 42.5 and capital-gain <= 7073.5 -0.14
                                                                                                                                     age <= 38.5 and education_ Masters <= 0.5 -0.38
                                                                       capital-gain <= 7073.5 and marital-status_ Married-civ-spouse <= 0.5 and workclass_ Self-emp-inc <= 0.5 -0.37
                                                                                             age > 27.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 38.5  0.86
                                                                                 age <= 62.5 and age > 27.5 and marital-status_ Married-civ-spouse > 0.5 and race_ White > 0.5  0.47
                                                               age > 29.5 and education_ HS-grad <= 0.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 33.5  0.03
age > 33.5 and education_ 11th <= 0.5 and capital-gain <= 4782.0 and marital-status_ Married-civ-spouse > 0.5 and occupation_ Farming-fishing <= 0.5 and hours-per-week > 37.5  0.07
                                                                                             age > 42.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 28.5  0.25
              age <= 52.0 and age > 27.5 and fnlwgt > 134350.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 32.5 and occupation_ Machine-op-inspct <= 0.5  0.64
     fnlwgt > 104201.0 and capital-gain <= 7268.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 35.5 and workclass_ ? <= 0.5 and workclass_ Private <= 0.5  0.17

where for categorical >0.5 means True, <=0.5 means False

@csinva
Copy link
Owner

csinva commented Mar 2, 2023

Thanks @mglowacki100! I agee I think one-hot encoding is the best way to go for now.

That feature engineering is performed by autogluon not imodels. There isn't currently support for inverse transforming back to the original features, but we will try and add it soon!

@vinay-k12
Copy link
Author

Thought of that but was thinking that this would increase training time hugely. But anyways, I'll run it on limited features.

@mglowacki100
Copy link

Hi @csinva, I see you're autogluon contributor, so two additional things regarding interpretable:

  • GreedyTree, `HiearchicalShrinkageTree' - displays feature_1, feature_2, ... (there is warning: X has feature names but ... was fitted without feature names), I'm not 100% sure but it seems to me that feature_1 is first column in training dataframe, feature_2 second column and so on...
  • BoostedRules doesn't display rules

@Innixma
Copy link

Innixma commented May 25, 2023

This a tricky situation, as I don't think it is possible for the categorical feature rules to display meaningful information in a low-split-count model without one-hot-encoding them, since we use label encoding where a tree model split is nearly impossible to interpret. However, you probably pay a huge performance and accuracy penalty by one-hot-encoding.

@csinva in autogluon/autogluon#2981 I am moving the interpretable logic into its own class called InterpretableTabularPredictor where I disable models like the weighted ensemble and post-hoc calibration that would corrupt the interpretable aspects of the models. One option would be to implement a custom feature generator that includes a 1-hot-encoding stage for all categoricals. I'm unsure if this would be a satisfying solution, so I'd like to hear your thoughts.

@csinva
Copy link
Owner

csinva commented May 25, 2023

Thanks, I think one-hot encoding categorical variables is a decent solution, as it should atleast preserve interpretability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants