Understanding the engineered features in Autogluaon. #163

vinay-k12 · 2023-02-28T11:26:35Z

I was using interpretrable models in autogluaon. While the model training was easier but the challenges is in understand the rules as the rules are created using engineered features and we do not have visibility on the feature engineering. For example, this was rule created when I was running on Lendclub data.

There is such value as '11' in 'emp_title'. So, how do we reverse transform the value '11' back to original data?

mglowacki100 · 2023-03-01T22:04:28Z

I'm not sure if autogluon creates those names, but if you look for quick-fix, you need to one-hot encode categorical variables by yourself:

def dummification(df, col):
  dfz = pd.get_dummies(df[col], prefix=col)
  df = df.drop(columns=[col])
  return pd.concat([df, dfz], axis=1)

...
train_data = train_data.drop(columns='education-num') # education_num is just education encoded by 'ordinal'
categorical = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']

for c in categorical:
  train_data = dummification(train_data, c)

with this:

predictor.print_interpretable_rules(model_name='RuleFit_3')
                                                                                                                                                                          rule  coef
                                                                                                                                                                  capital-gain  0.00
                                                                                                                      capital-gain <= 6571.5 and education_ Prof-school <= 0.5 -0.47
                                                                                                                 capital-gain <= 7073.5 and occupation_ Exec-managerial <= 0.5 -0.44
                                                                                               fnlwgt <= 260314.5 and capital-gain <= 7073.5 and education_ Prof-school <= 0.5 -0.19
                                                                                  capital-gain <= 7268.5 and education_ Bachelors <= 0.5 and occupation_ Prof-specialty <= 0.5 -0.36
                                               capital-gain <= 6571.5 and education_ Bachelors <= 0.5 and occupation_ Prof-specialty <= 0.5 and workclass_ Self-emp-inc <= 0.5 -0.85
                                                                                                                                        age <= 42.5 and capital-gain <= 7073.5 -0.14
                                                                                                                                     age <= 38.5 and education_ Masters <= 0.5 -0.38
                                                                       capital-gain <= 7073.5 and marital-status_ Married-civ-spouse <= 0.5 and workclass_ Self-emp-inc <= 0.5 -0.37
                                                                                             age > 27.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 38.5  0.86
                                                                                 age <= 62.5 and age > 27.5 and marital-status_ Married-civ-spouse > 0.5 and race_ White > 0.5  0.47
                                                               age > 29.5 and education_ HS-grad <= 0.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 33.5  0.03
age > 33.5 and education_ 11th <= 0.5 and capital-gain <= 4782.0 and marital-status_ Married-civ-spouse > 0.5 and occupation_ Farming-fishing <= 0.5 and hours-per-week > 37.5  0.07
                                                                                             age > 42.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 28.5  0.25
              age <= 52.0 and age > 27.5 and fnlwgt > 134350.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 32.5 and occupation_ Machine-op-inspct <= 0.5  0.64
     fnlwgt > 104201.0 and capital-gain <= 7268.5 and marital-status_ Married-civ-spouse > 0.5 and hours-per-week > 35.5 and workclass_ ? <= 0.5 and workclass_ Private <= 0.5  0.17

where for categorical >0.5 means True, <=0.5 means False

csinva · 2023-03-02T04:04:22Z

Thanks @mglowacki100! I agee I think one-hot encoding is the best way to go for now.

That feature engineering is performed by autogluon not imodels. There isn't currently support for inverse transforming back to the original features, but we will try and add it soon!

vinay-k12 · 2023-03-02T05:28:18Z

Thought of that but was thinking that this would increase training time hugely. But anyways, I'll run it on limited features.

mglowacki100 · 2023-03-02T07:09:28Z

Hi @csinva, I see you're autogluon contributor, so two additional things regarding interpretable:

GreedyTree, `HiearchicalShrinkageTree' - displays feature_1, feature_2, ... (there is warning: X has feature names but ... was fitted without feature names), I'm not 100% sure but it seems to me that feature_1 is first column in training dataframe, feature_2 second column and so on...
BoostedRules doesn't display rules

Innixma · 2023-05-25T22:00:01Z

This a tricky situation, as I don't think it is possible for the categorical feature rules to display meaningful information in a low-split-count model without one-hot-encoding them, since we use label encoding where a tree model split is nearly impossible to interpret. However, you probably pay a huge performance and accuracy penalty by one-hot-encoding.

@csinva in autogluon/autogluon#2981 I am moving the interpretable logic into its own class called InterpretableTabularPredictor where I disable models like the weighted ensemble and post-hoc calibration that would corrupt the interpretable aspects of the models. One option would be to implement a custom feature generator that includes a 1-hot-encoding stage for all categoricals. I'm unsure if this would be a satisfying solution, so I'd like to hear your thoughts.

csinva · 2023-05-25T22:13:42Z

Thanks, I think one-hot encoding categorical variables is a decent solution, as it should atleast preserve interpretability.

Innixma mentioned this issue May 25, 2023

Interpretable Models: Add one-hot-encoding for categoricals autogluon/autogluon#3242

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the engineered features in Autogluaon. #163

Understanding the engineered features in Autogluaon. #163

vinay-k12 commented Feb 28, 2023

mglowacki100 commented Mar 1, 2023

csinva commented Mar 2, 2023

vinay-k12 commented Mar 2, 2023

mglowacki100 commented Mar 2, 2023

Innixma commented May 25, 2023

csinva commented May 25, 2023

Understanding the engineered features in Autogluaon. #163

Understanding the engineered features in Autogluaon. #163

Comments

vinay-k12 commented Feb 28, 2023

mglowacki100 commented Mar 1, 2023

csinva commented Mar 2, 2023

vinay-k12 commented Mar 2, 2023

mglowacki100 commented Mar 2, 2023

Innixma commented May 25, 2023

csinva commented May 25, 2023