Automatically drop features tagged as "primary key" #1862

dsherry · 2021-02-18T20:25:16Z

There are plans to add a "primary key" semantic tag to woodwork, to indicate a column which can be used for joining multiple tables, in featuretools or elsewhere. Once that is added, we should update evalml to add a DropColumn component to remove column(s) tagged with "primary key".

dsherry · 2021-03-19T20:49:47Z

@angela97lin and others: heads up, this isn't part of #1730 , but if we add this before the data check actions stuff is in place, we should file another issue to track porting the implementation to use data check actions.

jeremyliweishih · 2021-03-23T19:17:17Z

My plan moving forward is to is to add a Drop Columns Transformer in _get_preprocessing_components when an index column exists and then add those columns to self. pipeline_parameters. However, #2015 showcases a bug where we cannot specify list valued hyperparameters in pipeline_parameters. If we go with this solution #2015 will be blocking as well.

jeremyliweishih · 2021-03-26T18:57:32Z

@dsherry
After discussion, EvalML asks are as follows:

Primary key will be provided as an input column. Additionally, for pandas, it will be set as DF index.
Don’t pass “primary key” columns to the estimators.
Don’t run data checks on the index feature.
Don’t drop index for prediction explanations.

Important point: if a WW DT has a column with the index semantic tag - after converting the DT to a DF, the column will remain in the DataFrame AND the DataFrame index will be set to have the same values as the column that was specified as the index. This means the column will remain in the DateFrame and it is unlikely that our transformers or estimators will be able to automatically filter this column out. The options below will only apply to WW DT inputs as pandas indexes are not an input column and will not become a column even when converting a pandas DF to a WW DT.

Option 1: Add a drop column component at the beginning of a pipeline and add an add column component at the end of the pipeline in AutoML. Will need to edit AutoML logic (creating pipelines etc.) to accommodate for this.

Pros: Separates index handling from other components, index behavior only done when we want it, applies to prediction explanations
Cons: Does not apply to datachecks, will only apply to AutoML pipelines

Option 2: Add and drop index columns in each individual component. Can be done in transform() and to generalize we can add to ComponentBaseMeta for a wrapper around transform()

Pros: EvalML pipelines will automatically handle index columns, applies to prediction explanations
Cons: Extra compute per component but difference should be slight, does not apply to datachecks

Option 3 (for datachecks): Add and drop index columns in each datacheck's validate() method and to generalize we can add to a new DataCheckMeta class for a wrapper around validate()

Overall we need to choose a combination of 1 + 3 or 2+ 3. I am a fan of 2 + 3 as option 2 moves handling index columns as an automl feature into a general evalml feature (through components) and the usage of wrappers and metaclasses will reduce current code change and demand on future developers from adding boilerplate.

dsherry · 2021-03-26T21:25:19Z

@jeremyliweishih thanks for posting this. Your summary of the requirements is excellent.

I agree with your point about DT/DF: we must filter index features out before we convert from ww to DF. It helps that under the new ww accessor model, that "conversion" is no longer necessary! #1965

I would split this problem into two parts. The first is pipeline evaluation and model understanding, where the requirements are "Don’t pass 'primary key' columns to the estimators" and "Don’t drop index for prediction explanations." Your options 1 and 2 apply to pipeline evaluation / understanding. The second is the data checks, where the requirement is "Don’t run data checks on the index feature." Option 3 applies to the data checks.

I have some ideas for the pipeline evaluation / understanding solution, ordered by my guess at their cost:
Option 4: add drop component to beginning of pipeline which drops all index features. That's all. Simpler variant of option 1. Pred expls can access the index feature using the dataframe index.
Option 5: ensure that every component we support selects the specific feature types which it is supported for before running. Components will therefore ignore index-tagged features because they don't count as numeric, categorical, natural language or datetime features, which is what our components currently operate on. Pred expls can access the index feature using the dataframe index.
Option 6: define separate pipeline paths for each feature type (#1728 ), numeric/categorical/natlang/datetime/index. Don't add any components to the index path, indicating that information is not passed to the rest of the DAG. Pred expls can access the index feature using the dataframe index.

We want to do option 6 in the long-term. But in the short-term, I'm a fan of 4 or 5.

And for datachecks (unordered):
Option 7: add code to the DataChecks.validate collection method which first drops all index features, then calls each data check's validate method.
Option 8: have every data check select the particular feature types it is defined for before running the check. No data checks should include index, just numeric/categorical/natlang/datetime. The specification of which types the data check is allowed for could happen in the impl, or it could be a class attribute which each data check is asked to set, default numeric/categorical/natlang/datetime.

7 is the easiest to implement and feels like the right short-term option. 8 may be interesting though; if we can think of other use-cases for limiting by feature type, that would be my choice.

Thoughts?

jeremyliweishih · 2021-03-29T17:55:52Z

Thanks for clarifying @dsherry!

For pipeline evaluation / understanding solution I'm a fan of option 4 as it addresses our requirements without adding more complexity to our pipeline and component APIs (especially since this will be a short term solution). Likewise, Im a fan of option 7 for datachecks for the same reason. Currently, 4/10 of our datachecks (IDColumnsDataCheck, OutliersDataCheck, TargetLeakageDataCheck) does some sort of data type selection or checking so option 8 does have potential. How about I move forward with option 7 and file option 8 as a future improvement we can consider?

dsherry · 2021-03-30T22:41:36Z

@jeremyliweishih yep that sounds good to me! @tyler3991 FYI

dsherry added the new feature Features which don't yet exist. label Feb 18, 2021

dsherry assigned jeremyliweishih Mar 23, 2021

jeremyliweishih mentioned this issue Mar 23, 2021

Can't specify list-valued hyperparameters in AutoMLSearch #2015

Closed

This was referenced Apr 14, 2021

Handle index columns in AutoMLSearch and DataChecks #2138

Merged

DataCheck Action: Drop index columns #2164

Open

jeremyliweishih closed this as completed in #2138 Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically drop features tagged as "primary key" #1862

Automatically drop features tagged as "primary key" #1862

dsherry commented Feb 18, 2021

dsherry commented Mar 19, 2021

jeremyliweishih commented Mar 23, 2021

jeremyliweishih commented Mar 26, 2021 •

edited

dsherry commented Mar 26, 2021

jeremyliweishih commented Mar 29, 2021

dsherry commented Mar 30, 2021

Automatically drop features tagged as "primary key" #1862

Automatically drop features tagged as "primary key" #1862

Comments

dsherry commented Feb 18, 2021

dsherry commented Mar 19, 2021

jeremyliweishih commented Mar 23, 2021

jeremyliweishih commented Mar 26, 2021 • edited

dsherry commented Mar 26, 2021

jeremyliweishih commented Mar 29, 2021

dsherry commented Mar 30, 2021

jeremyliweishih commented Mar 26, 2021 •

edited