Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change _schema_is_equal check to _schema_is_compatible and use training schema for predict data #4133

Open
tamargrey opened this issue Apr 10, 2023 · 1 comment

Comments

@tamargrey
Copy link
Contributor

Currently, at ComponentGraph._transform_features, when the graph is not already fit, we do a check for whether or not X's woodwork schema is equal to the ComponentGraph.input_types. If the types do not match, we raise a PipelineError. We do this, because having different types at train vs predict can cause unpredictable and confusing errors in our components.

However, this way of checking for and handling unequal schemas can be problematic. The first reason is just that the error message, Input X data types are different from the input types the pipeline was fitted on. isn't very detailed, and the details are just Woodwork.TableSchema.types, which doesn't contain information like feature origins or the woodwork metadata, making debugging this error difficult. The second problem is that checking for schema equality is too restrictive. There are cases when the data may have slightly different woodwork types inferred, but the data is inherently still compatible with the original types, so we shouldn't need to raise an error. Examples of this are if null values are present, causing data that was originally Integer to be IntegerNullable, for example, or if a column that was Categorical gets inferred as Unknown once there's a much smaller dataset at predict.

We should change this logic to be more permissive of these types of changes as long as the data is still compatible with the original types and improve the description of what is different between the schemas.

To do this, we can:

  1. Check for woodwork schema equality and warn if there are any differences - whether there are different columns present or logical types are different or woodwork metadata or anything else. We need a better way to describe the difference between woodwork schemas.
  2. If the schemas are not equal, attempt to initialize X with the ComponentGraph.input_types via X.ww.init(schema=self._input_types). As long as the new data is compatible with the original schema, this will work. If some columns have been lost or logical types are incompatible with the data, a woodwork error will be raised. We can then catch that to raise our own error if we'd like.

Note 1: There is some logic that relates to the dfs transformer at this step - if it is present in the graph, we only check the equality of the non engineered features. This logic will still be needed, and improving the descriptions of the difference between two woodwork schemas will make bugs around this logic easier to understand (aka if we don't maintain feature origins, causing there to be different sets of columns in the schemas, we can see it!).

Note 2: We should complete #4077 as part of this implementation. It will require that we override the logical types from self._input_types with the nullable types if theyre present in X and not the corresponding column in self._input_types

@tamargrey
Copy link
Contributor Author

Opened alteryx/woodwork#1670 which will be necessary to properly display the differences in woodwork typing info to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant