Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically drop features tagged as "primary key" #1862

Closed
dsherry opened this issue Feb 18, 2021 · 6 comments · Fixed by #2138
Closed

Automatically drop features tagged as "primary key" #1862

dsherry opened this issue Feb 18, 2021 · 6 comments · Fixed by #2138
Assignees
Labels
new feature Features which don't yet exist.

Comments

@dsherry
Copy link
Contributor

dsherry commented Feb 18, 2021

There are plans to add a "primary key" semantic tag to woodwork, to indicate a column which can be used for joining multiple tables, in featuretools or elsewhere. Once that is added, we should update evalml to add a DropColumn component to remove column(s) tagged with "primary key".

@dsherry dsherry added the new feature Features which don't yet exist. label Feb 18, 2021
@dsherry
Copy link
Contributor Author

dsherry commented Mar 19, 2021

@angela97lin and others: heads up, this isn't part of #1730 , but if we add this before the data check actions stuff is in place, we should file another issue to track porting the implementation to use data check actions.

@jeremyliweishih
Copy link
Contributor

My plan moving forward is to is to add a Drop Columns Transformer in _get_preprocessing_components when an index column exists and then add those columns to self. pipeline_parameters. However, #2015 showcases a bug where we cannot specify list valued hyperparameters in pipeline_parameters. If we go with this solution #2015 will be blocking as well.

@jeremyliweishih
Copy link
Contributor

jeremyliweishih commented Mar 26, 2021

@dsherry
After discussion, EvalML asks are as follows:

  • Primary key will be provided as an input column. Additionally, for pandas, it will be set as DF index.
  • Don’t pass “primary key” columns to the estimators.
  • Don’t run data checks on the index feature.
  • Don’t drop index for prediction explanations.

Important point: if a WW DT has a column with the index semantic tag - after converting the DT to a DF, the column will remain in the DataFrame AND the DataFrame index will be set to have the same values as the column that was specified as the index. This means the column will remain in the DateFrame and it is unlikely that our transformers or estimators will be able to automatically filter this column out. The options below will only apply to WW DT inputs as pandas indexes are not an input column and will not become a column even when converting a pandas DF to a WW DT.

Option 1: Add a drop column component at the beginning of a pipeline and add an add column component at the end of the pipeline in AutoML. Will need to edit AutoML logic (creating pipelines etc.) to accommodate for this.

  • Pros: Separates index handling from other components, index behavior only done when we want it, applies to prediction explanations
  • Cons: Does not apply to datachecks, will only apply to AutoML pipelines

Option 2: Add and drop index columns in each individual component. Can be done in transform() and to generalize we can add to ComponentBaseMeta for a wrapper around transform()

  • Pros: EvalML pipelines will automatically handle index columns, applies to prediction explanations
  • Cons: Extra compute per component but difference should be slight, does not apply to datachecks

Option 3 (for datachecks): Add and drop index columns in each datacheck's validate() method and to generalize we can add to a new DataCheckMeta class for a wrapper around validate()

Overall we need to choose a combination of 1 + 3 or 2+ 3. I am a fan of 2 + 3 as option 2 moves handling index columns as an automl feature into a general evalml feature (through components) and the usage of wrappers and metaclasses will reduce current code change and demand on future developers from adding boilerplate.

@dsherry
Copy link
Contributor Author

dsherry commented Mar 26, 2021

@jeremyliweishih thanks for posting this. Your summary of the requirements is excellent.

I agree with your point about DT/DF: we must filter index features out before we convert from ww to DF. It helps that under the new ww accessor model, that "conversion" is no longer necessary! #1965

I would split this problem into two parts. The first is pipeline evaluation and model understanding, where the requirements are "Don’t pass 'primary key' columns to the estimators" and "Don’t drop index for prediction explanations." Your options 1 and 2 apply to pipeline evaluation / understanding. The second is the data checks, where the requirement is "Don’t run data checks on the index feature." Option 3 applies to the data checks.

I have some ideas for the pipeline evaluation / understanding solution, ordered by my guess at their cost:
Option 4: add drop component to beginning of pipeline which drops all index features. That's all. Simpler variant of option 1. Pred expls can access the index feature using the dataframe index.
Option 5: ensure that every component we support selects the specific feature types which it is supported for before running. Components will therefore ignore index-tagged features because they don't count as numeric, categorical, natural language or datetime features, which is what our components currently operate on. Pred expls can access the index feature using the dataframe index.
Option 6: define separate pipeline paths for each feature type (#1728 ), numeric/categorical/natlang/datetime/index. Don't add any components to the index path, indicating that information is not passed to the rest of the DAG. Pred expls can access the index feature using the dataframe index.

We want to do option 6 in the long-term. But in the short-term, I'm a fan of 4 or 5.

And for datachecks (unordered):
Option 7: add code to the DataChecks.validate collection method which first drops all index features, then calls each data check's validate method.
Option 8: have every data check select the particular feature types it is defined for before running the check. No data checks should include index, just numeric/categorical/natlang/datetime. The specification of which types the data check is allowed for could happen in the impl, or it could be a class attribute which each data check is asked to set, default numeric/categorical/natlang/datetime.

7 is the easiest to implement and feels like the right short-term option. 8 may be interesting though; if we can think of other use-cases for limiting by feature type, that would be my choice.

Thoughts?

@jeremyliweishih
Copy link
Contributor

Thanks for clarifying @dsherry!

For pipeline evaluation / understanding solution I'm a fan of option 4 as it addresses our requirements without adding more complexity to our pipeline and component APIs (especially since this will be a short term solution). Likewise, Im a fan of option 7 for datachecks for the same reason. Currently, 4/10 of our datachecks (IDColumnsDataCheck, OutliersDataCheck, TargetLeakageDataCheck) does some sort of data type selection or checking so option 8 does have potential. How about I move forward with option 7 and file option 8 as a future improvement we can consider?

@dsherry
Copy link
Contributor Author

dsherry commented Mar 30, 2021

@jeremyliweishih yep that sounds good to me! @tyler3991 FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Features which don't yet exist.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants