Update pipeline and components to return Woodwork data structures #1406

angela97lin · 2020-11-04T17:23:50Z

#1393 updated pipelines to accept Woodwork data structures, and #1288 addresses updating pipelines and components to accept Woodwork data structures as input. However, the output for methods like transform and predict are still pandas DataFrames, which is odd. This issue tracks updating our methods to return Woodwork data structures.

The text was updated successfully, but these errors were encountered:

angela97lin · 2021-01-04T16:20:55Z

Punting this for now, given Woodwork is finalizing plans on big updates. If Woodwork becomes an extension of pandas, we may not want or need to do this.

dsherry · 2021-01-14T23:43:30Z

@angela97lin and I checked in, and discussed a few implementation options:

Have component graph evaluation pass pandas to each component. To indicate ww types to components, either add new fields to fit etc., or stick with the text featurizer pattern of using init parameters to indicate relevant columns. Disadvantage: ugly from API perspective, this is why we created woodwork in the first place.
During component graph evaluation, pass woodwork to each component. Have each component return pandas. Disadvantage: a potential limitation is that components cannot change woodwork type of the input features or of newly generated features, except through changing the pandas dtype. We don't have any components which rely on this however.
During component graph evaluation, pass woodwork to each component. Have each component return woodwork. Challenge: most components must convert to pandas internally to do transformations like adding features, deleting features or modifying features. After those transformations, we have to make sure the original woodwork types get into the new returned woodwork datatable, otherwise user-overridden settings will get lost, as they are today.

Status: @angela97lin is currently pursuing option 3 in #1668

Plan: we'll continue that strategy, keeping an eye out for reduced runtime due to multiple ww datatable instantiations. And we'll consider if there are any feature requests we should make to woodwork to make this easier. We'll also keep an eye out for any compelling options we may have missed so far.

@chukarsten @gsheni

chukarsten · 2021-01-14T23:50:10Z

It seems like the third option is the best, cleanest option. Hopefully the performance isn't impacted, but conceptually it seems sound. Thanks for bringing it to my attention...trying to wrap my head around all of the things.

angela97lin · 2021-01-15T01:08:02Z

Hacking on this and thinking some more:

The end goal is that we need some way to keep track of the original logical types that the user wants. This could be information held by the component graph, or passed along to each component which is responsible for then setting those types back after transforming some data. Currently pursuing 3, and adding the information to component graph since that's the easiest to test (rather than updating every component)... but on the component level, it wouldn't make sense.

Say a user specifies a Woodwork DataTable and explicitly converts a categorical col to natural language. The user passes that to a component. We need to convert to pandas to pass to external libraries, and we'd like to return a Woodwork object. If we simply call a Woodwork constructor, it would only take the inferred type (categorical), which is odd? So we should keep track of the original specified natural language type and convert before handing back to user.

Interesting to note is the standard scaler: it could take int columns and convert them to floats. If we then try to set the col back to the original type (int), we'll get yelled at for trying to convert floats to int when that's not safe. 😬

angela97lin · 2021-01-15T19:42:30Z

Update: had a quick discussion with @dsherry and @chukarsten. I'm currently implementing #3, but having the component graph handle keeping track of the original user types and updating that information as it gets passed from component to component. This works okay and gets us to a place where AutoML / pipelines work, but after #1668 is merged, we should tackle handling this on the component level and removing this code from component graph.

My next to-dos: fix index tests from updating branch from main, clean up comments and file issues that can be addressed not-related to this PR (just general cleanup code). Once code is more clean, look for redundancies and profile to see where this huge time difference is coming from / what we can do about it.

angela97lin self-assigned this Nov 4, 2020

angela97lin mentioned this issue Dec 3, 2020

Integrate Woodwork DataTables into EvalML #1229

Closed

dsherry mentioned this issue Jan 7, 2021

Custom woodwork types aren't preserved in components #1662

Closed

angela97lin mentioned this issue Jan 8, 2021

Update components and pipelines to return Woodwork data structures #1668

Merged

dsherry mentioned this issue Jan 8, 2021

Allow Objectives to take in list inputs #1663

Merged

angela97lin closed this as completed in #1668 Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pipeline and components to return Woodwork data structures #1406

Update pipeline and components to return Woodwork data structures #1406

angela97lin commented Nov 4, 2020

angela97lin commented Jan 4, 2021

dsherry commented Jan 14, 2021

chukarsten commented Jan 14, 2021

angela97lin commented Jan 15, 2021

angela97lin commented Jan 15, 2021

Update pipeline and components to return Woodwork data structures #1406

Update pipeline and components to return Woodwork data structures #1406

Comments

angela97lin commented Nov 4, 2020

angela97lin commented Jan 4, 2021

dsherry commented Jan 14, 2021

chukarsten commented Jan 14, 2021

angela97lin commented Jan 15, 2021

angela97lin commented Jan 15, 2021