You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Issue already discussed in this slack thread, logging here for easier tracking)
Currently, the names used for saving artifacts in a step are determined by the type annotation in the function definition.
This works great when a step is only used once in a pipeline, but not so much when the same step needs to be called multiple times with different inputs, and the resulting artifacts need to later be used in a different pipeline. When that happens, the output artifacts get saved with the same name and a bumped version number, which makes it really hard to track the specific one needed later down the road.
Typical example of this: training some preprocessor that later needs to be used three different times for transforming train, validation, and test data, and I end up with three versions of an object called transformed_data and need to keep track of which is which.
Returning outputs from pipelines quickly gets out of control and is very hard to maintain when there are lots of artifacts that might potentially be reused later.
Suggested solution: Similar to how a step name (usually the function name) can be overriden at run time by the id parameter when calling the step, introduce an optional parameter to step calls (something like output_names: Optional[Dict[str, str]] where the dict must contain the names defined in the function type annotations as keys and the desired saved names as values) overriding the saved names of the outputs. If that parameter is not passed, things should behave as they currently do, but when passed the produced artifacts should be saved with the passed names.
Reproduction steps
No response
Relevant log output
No response
Code of Conduct
I agree to follow this project's Code of Conduct
The text was updated successfully, but these errors were encountered:
Contact Details [Optional]
No response
System Information
N/A
What happened?
(Issue already discussed in this slack thread, logging here for easier tracking)
Currently, the names used for saving artifacts in a
step
are determined by the type annotation in the function definition.This works great when a step is only used once in a pipeline, but not so much when the same step needs to be called multiple times with different inputs, and the resulting artifacts need to later be used in a different pipeline. When that happens, the output artifacts get saved with the same name and a bumped version number, which makes it really hard to track the specific one needed later down the road.
Typical example of this: training some preprocessor that later needs to be used three different times for transforming train, validation, and test data, and I end up with three versions of an object called
transformed_data
and need to keep track of which is which.Returning outputs from pipelines quickly gets out of control and is very hard to maintain when there are lots of artifacts that might potentially be reused later.
Suggested solution: Similar to how a step name (usually the function name) can be overriden at run time by the
id
parameter when calling the step, introduce an optional parameter to step calls (something likeoutput_names: Optional[Dict[str, str]]
where the dict must contain the names defined in the function type annotations as keys and the desired saved names as values) overriding the saved names of the outputs. If that parameter is not passed, things should behave as they currently do, but when passed the produced artifacts should be saved with the passed names.Reproduction steps
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: