New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically drop features tagged as "primary key" #1862
Comments
@angela97lin and others: heads up, this isn't part of #1730 , but if we add this before the data check actions stuff is in place, we should file another issue to track porting the implementation to use data check actions. |
My plan moving forward is to is to add a |
@dsherry
Important point: if a WW DT has a column with the Option 1: Add a drop column component at the beginning of a pipeline and add an add column component at the end of the pipeline in AutoML. Will need to edit AutoML logic (creating pipelines etc.) to accommodate for this.
Option 2: Add and drop index columns in each individual component. Can be done in
Option 3 (for datachecks): Add and drop index columns in each datacheck's Overall we need to choose a combination of 1 + 3 or 2+ 3. I am a fan of 2 + 3 as option 2 moves handling index columns as an automl feature into a general evalml feature (through components) and the usage of wrappers and metaclasses will reduce current code change and demand on future developers from adding boilerplate. |
@jeremyliweishih thanks for posting this. Your summary of the requirements is excellent. I agree with your point about DT/DF: we must filter index features out before we convert from ww to DF. It helps that under the new ww accessor model, that "conversion" is no longer necessary! #1965 I would split this problem into two parts. The first is pipeline evaluation and model understanding, where the requirements are "Don’t pass 'primary key' columns to the estimators" and "Don’t drop index for prediction explanations." Your options 1 and 2 apply to pipeline evaluation / understanding. The second is the data checks, where the requirement is "Don’t run data checks on the index feature." Option 3 applies to the data checks. I have some ideas for the pipeline evaluation / understanding solution, ordered by my guess at their cost: We want to do option 6 in the long-term. But in the short-term, I'm a fan of 4 or 5. And for datachecks (unordered): 7 is the easiest to implement and feels like the right short-term option. 8 may be interesting though; if we can think of other use-cases for limiting by feature type, that would be my choice. Thoughts? |
Thanks for clarifying @dsherry! For pipeline evaluation / understanding solution I'm a fan of option 4 as it addresses our requirements without adding more complexity to our pipeline and component APIs (especially since this will be a short term solution). Likewise, Im a fan of option 7 for datachecks for the same reason. Currently, 4/10 of our datachecks (IDColumnsDataCheck, OutliersDataCheck, TargetLeakageDataCheck) does some sort of data type selection or checking so option 8 does have potential. How about I move forward with option 7 and file option 8 as a future improvement we can consider? |
@jeremyliweishih yep that sounds good to me! @tyler3991 FYI |
There are plans to add a "primary key" semantic tag to woodwork, to indicate a column which can be used for joining multiple tables, in featuretools or elsewhere. Once that is added, we should update evalml to add a DropColumn component to remove column(s) tagged with "primary key".
The text was updated successfully, but these errors were encountered: