You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want a way to supply pipeline stage constructor parameters with future-like placeholders, so that actual values will be determined by prior stages on application time.
I would assign the name of the future ApplicationContext key holding the value after it was calculated — wrapped by a unique class — to the parameter.
The constructor will hold on to this object, and will — in application time — pull the value of the right key from either the fit context or the application context (depending on how I set it: I might want this value to be set on pipeline fit, or to be set on each application dynamically, even on transforms when the pipeline is fitted), and use it for the transformation. Default should probably be the fit context?
Here's an example:
importnumpyasnp; importpandasaspd; importpdpipeaspdp;
defscaling_decider(X: pd.DataFrame) ->str:
"""Determines with type of scaling to apply by examining all numerical columns."""numX=X.select_dtypes(include=np.number)
forcolinnumX.columns:
# this is nonsense logic, just an exampleifnp.std(numX[col]) >2*np.mean(numX[col]):
return'StandardScaler'return'MinMaxScaler'pipeline=pdp.PdPipeline(stages=[
pdp.ColDrop(pdp.cq.StartWith('n_'),
pdp.ApplicationContextEnricher(scaling_type=scaling_decider),
pdp.Scale(
# fit=False means it will take it from the application context, and not fit contextscaler=pdp.contextual('scaling_type', fit=False),
joint=True,
),
])
Design
This has to be implemented at the PdPipeline base class. Since the base class can't hijack constructor arguments, I think the contract with extending classes should be:
When implementing class exteding PdPipeline, if you want to enjoy support for contextual constructor parameters, you MUST delay any initialization of inner state objects to fit_transform, so that fit/application context is available on initialization (it is NOT available at pipeline stage costruction and initialization, after all).
pdp.contextual is a factory function that returns contextual parameter placeholder objects. Code using it shouldn't really care about it, as it should never interact with the resulting objects directly. I think.
PdPipeline can auto-magically make sure that any attribute of a PdPipeline instance that is assigned a pdp.contextual object in the constructor (e.g. self.k = k, and the k constructor argument was provided with k=pdp.future('pca_k')) will be hot-swapped with a concrete value by the time we wish to use it fit_transform or transform (for example, when we call self.pca_ = PCA(k=self.k)). It can also do so for any such object that is contained in any iterable or dict-like attribute (so if I have self._pca_kwargs = {...} in my constructor, I can safely call self.pca_ = PCA(**self._pca_kwargs) in fit_transform().
Implementation thoughts
To make this efficient, since this means accessing sub-class attribute instance on pipeline transformations, I have a few thoughts:
The contextuals module should have a global variable such as CONTEXTUALS_ARE_ON = False. Then, the pdp.contextual factory function sets global CONTEXTUALS_ARE_ON; CONTEXTUALS_ARE_ON = True when called. Then, we condition the whole inspection-heavy logic on this indicator variable, so that if our user never called pdp.contextual during the current kernel, runtime is saved.
I first thought pdp.contextual could somehow register _ContextualParam objects in a registery we could use to find what needed to be swapped, but actually this wouldn't help, as they won't know which attribute of which pipeline stage they were assigned to.
We thus have to scan sub-class attributes, but we can do so if and only after pdp.contextual was called, and right after pipeline stage initialization. Moreover, we can create a literal list of all attribute names we should ignore, stored in pdpipe.core as a global: e.g. _IGNORE_PDPSTAGE_ATT = ['_desc', '_name'], etc. Everything we know isn't an attribute the sub-class declared. Then, we can check any attribute that isn't one of these. This can be done in pdpipe.PdPipelineStage.__init__(), since it's called (by contract; we can demand that from extending subclasses) at the end of the __init__() method of subclasses. When we find that such an attribute holds a pdp.contextual, we register it at the pdp.contextuals module, in some global dict (or something more sophisticated), keyed by the attribute name. We can also registed the containig stage object.
Then, in a stage fit_transform and transform methods, if the current stage object is registered for contextual hot-swapping, we find the concrete contextual value of any attribute resigtered for this stage (in either the self.fit_context object or self.application_context object the pipeline injects all stages during applications of the pipeline) and hot-swap it literaly: This will look something like setattr(self, 'k', self.fit_context['pca_k']), since we're at pdpipe.PdPipelineStage.fit_transform(), and the self object is an instance of the subclass requiring the hot swap (in this case, pdp.Decompose).
The text was updated successfully, but these errors were encountered:
I want a way to supply pipeline stage constructor parameters with future-like placeholders, so that actual values will be determined by prior stages on application time.
I would assign the name of the future
ApplicationContext
key holding the value after it was calculated — wrapped by a unique class — to the parameter.The constructor will hold on to this object, and will — in application time — pull the value of the right key from either the fit context or the application context (depending on how I set it: I might want this value to be set on pipeline fit, or to be set on each application dynamically, even on transforms when the pipeline is fitted), and use it for the transformation. Default should probably be the fit context?
Here's an example:
Design
This has to be implemented at the
PdPipeline
base class. Since the base class can't hijack constructor arguments, I think the contract with extending classes should be:When implementing class exteding
PdPipeline
, if you want to enjoy support for contextual constructor parameters, you MUST delay any initialization of inner state objects tofit_transform
, so that fit/application context is available on initialization (it is NOT available at pipeline stage costruction and initialization, after all).pdp.contextual
is a factory function that returns contextual parameter placeholder objects. Code using it shouldn't really care about it, as it should never interact with the resulting objects directly. I think.PdPipeline
can auto-magically make sure that any attribute of aPdPipeline
instance that is assigned apdp.contextual
object in the constructor (e.g.self.k = k
, and thek
constructor argument was provided withk=pdp.future('pca_k')
) will be hot-swapped with a concrete value by the time we wish to use itfit_transform
ortransform
(for example, when we callself.pca_ = PCA(k=self.k)
). It can also do so for any such object that is contained in any iterable or dict-like attribute (so if I haveself._pca_kwargs = {...}
in my constructor, I can safely callself.pca_ = PCA(**self._pca_kwargs)
infit_transform()
.Implementation thoughts
To make this efficient, since this means accessing sub-class attribute instance on pipeline transformations, I have a few thoughts:
The
contextuals
module should have a global variable such asCONTEXTUALS_ARE_ON = False
. Then, thepdp.contextual
factory function setsglobal CONTEXTUALS_ARE_ON; CONTEXTUALS_ARE_ON = True
when called. Then, we condition the whole inspection-heavy logic on this indicator variable, so that if our user never calledpdp.contextual
during the current kernel, runtime is saved.I first thought
pdp.contextual
could somehow register_ContextualParam
objects in a registery we could use to find what needed to be swapped, but actually this wouldn't help, as they won't know which attribute of which pipeline stage they were assigned to.We thus have to scan sub-class attributes, but we can do so if and only after
pdp.contextual
was called, and right after pipeline stage initialization. Moreover, we can create a literal list of all attribute names we should ignore, stored inpdpipe.core
as a global: e.g._IGNORE_PDPSTAGE_ATT = ['_desc', '_name']
, etc. Everything we know isn't an attribute the sub-class declared. Then, we can check any attribute that isn't one of these. This can be done inpdpipe.PdPipelineStage.__init__()
, since it's called (by contract; we can demand that from extending subclasses) at the end of the__init__()
method of subclasses. When we find that such an attribute holds apdp.contextual
, we register it at thepdp.contextuals
module, in some global dict (or something more sophisticated), keyed by the attribute name. We can also registed the containig stage object.Then, in a stage
fit_transform
andtransform
methods, if the current stage object is registered for contextual hot-swapping, we find the concrete contextual value of any attribute resigtered for this stage (in either theself.fit_context
object orself.application_context
object the pipeline injects all stages during applications of the pipeline) and hot-swap it literaly: This will look something likesetattr(self, 'k', self.fit_context['pca_k'])
, since we're atpdpipe.PdPipelineStage.fit_transform()
, and theself
object is an instance of the subclass requiring the hot swap (in this case,pdp.Decompose
).The text was updated successfully, but these errors were encountered: