Feature Request: Contextual params using application context #105

shaypal5 · 2022-07-03T14:33:20Z

I want a way to supply pipeline stage constructor parameters with future-like placeholders, so that actual values will be determined by prior stages on application time.

I would assign the name of the future ApplicationContext key holding the value after it was calculated — wrapped by a unique class — to the parameter.

The constructor will hold on to this object, and will — in application time — pull the value of the right key from either the fit context or the application context (depending on how I set it: I might want this value to be set on pipeline fit, or to be set on each application dynamically, even on transforms when the pipeline is fitted), and use it for the transformation. Default should probably be the fit context?

Here's an example:

import numpy as np; import pandas as pd; import pdpipe as pdp;

def scaling_decider(X: pd.DataFrame) -> str:
    """Determines with type of scaling to apply by examining all numerical columns."""
    numX = X.select_dtypes(include=np.number)
    for col in numX.columns:
        # this is nonsense logic, just an example
        if np.std(numX[col]) > 2 * np.mean(numX[col]):
            return 'StandardScaler'
    return 'MinMaxScaler'

pipeline = pdp.PdPipeline(stages=[
    pdp.ColDrop(pdp.cq.StartWith('n_'),
    pdp.ApplicationContextEnricher(scaling_type=scaling_decider),
    pdp.Scale(
        # fit=False means it will take it from the application context, and not fit context
        scaler=pdp.contextual('scaling_type', fit=False),  
        joint=True,
    ),
])

Design

This has to be implemented at the PdPipeline base class. Since the base class can't hijack constructor arguments, I think the contract with extending classes should be:

When implementing class exteding PdPipeline, if you want to enjoy support for contextual constructor parameters, you MUST delay any initialization of inner state objects to fit_transform, so that fit/application context is available on initialization (it is NOT available at pipeline stage costruction and initialization, after all).
pdp.contextual is a factory function that returns contextual parameter placeholder objects. Code using it shouldn't really care about it, as it should never interact with the resulting objects directly. I think.
PdPipeline can auto-magically make sure that any attribute of a PdPipeline instance that is assigned a pdp.contextual object in the constructor (e.g. self.k = k, and the k constructor argument was provided with k=pdp.future('pca_k')) will be hot-swapped with a concrete value by the time we wish to use it fit_transform or transform (for example, when we call self.pca_ = PCA(k=self.k)). It can also do so for any such object that is contained in any iterable or dict-like attribute (so if I have self._pca_kwargs = {...} in my constructor, I can safely call self.pca_ = PCA(**self._pca_kwargs) in fit_transform().

Implementation thoughts

To make this efficient, since this means accessing sub-class attribute instance on pipeline transformations, I have a few thoughts:

The contextuals module should have a global variable such as CONTEXTUALS_ARE_ON = False. Then, the pdp.contextual factory function sets global CONTEXTUALS_ARE_ON; CONTEXTUALS_ARE_ON = True when called. Then, we condition the whole inspection-heavy logic on this indicator variable, so that if our user never called pdp.contextual during the current kernel, runtime is saved.
I first thought pdp.contextual could somehow register _ContextualParam objects in a registery we could use to find what needed to be swapped, but actually this wouldn't help, as they won't know which attribute of which pipeline stage they were assigned to.
We thus have to scan sub-class attributes, but we can do so if and only after pdp.contextual was called, and right after pipeline stage initialization. Moreover, we can create a literal list of all attribute names we should ignore, stored in pdpipe.core as a global: e.g. _IGNORE_PDPSTAGE_ATT = ['_desc', '_name'], etc. Everything we know isn't an attribute the sub-class declared. Then, we can check any attribute that isn't one of these. This can be done in pdpipe.PdPipelineStage.__init__(), since it's called (by contract; we can demand that from extending subclasses) at the end of the __init__() method of subclasses. When we find that such an attribute holds a pdp.contextual, we register it at the pdp.contextuals module, in some global dict (or something more sophisticated), keyed by the attribute name. We can also registed the containig stage object.

Then, in a stage fit_transform and transform methods, if the current stage object is registered for contextual hot-swapping, we find the concrete contextual value of any attribute resigtered for this stage (in either the self.fit_context object or self.application_context object the pipeline injects all stages during applications of the pipeline) and hot-swap it literaly: This will look something like setattr(self, 'k', self.fit_context['pca_k']), since we're at pdpipe.PdPipelineStage.fit_transform(), and the self object is an instance of the subclass requiring the hot swap (in this case, pdp.Decompose).

The text was updated successfully, but these errors were encountered:

shaypal5 added enhancement complex issue labels Jul 3, 2022

shaypal5 self-assigned this Jul 3, 2022

shaypal5 changed the title ~~Feature Request: Future params using application context~~ Feature Request: Contextual params using application context Jul 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Contextual params using application context #105

Feature Request: Contextual params using application context #105

shaypal5 commented Jul 3, 2022 •

edited

Feature Request: Contextual params using application context #105

Feature Request: Contextual params using application context #105

Comments

shaypal5 commented Jul 3, 2022 • edited

Design

Implementation thoughts

shaypal5 commented Jul 3, 2022 •

edited