Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Contextual params using application context #105

Open
shaypal5 opened this issue Jul 3, 2022 · 0 comments
Open

Feature Request: Contextual params using application context #105

shaypal5 opened this issue Jul 3, 2022 · 0 comments

Comments

@shaypal5
Copy link
Collaborator

shaypal5 commented Jul 3, 2022

I want a way to supply pipeline stage constructor parameters with future-like placeholders, so that actual values will be determined by prior stages on application time.

I would assign the name of the future ApplicationContext key holding the value after it was calculated — wrapped by a unique class — to the parameter.

The constructor will hold on to this object, and will — in application time — pull the value of the right key from either the fit context or the application context (depending on how I set it: I might want this value to be set on pipeline fit, or to be set on each application dynamically, even on transforms when the pipeline is fitted), and use it for the transformation. Default should probably be the fit context?

Here's an example:

import numpy as np; import pandas as pd; import pdpipe as pdp;

def scaling_decider(X: pd.DataFrame) -> str:
    """Determines with type of scaling to apply by examining all numerical columns."""
    numX = X.select_dtypes(include=np.number)
    for col in numX.columns:
        # this is nonsense logic, just an example
        if np.std(numX[col]) > 2 * np.mean(numX[col]):
            return 'StandardScaler'
    return 'MinMaxScaler'

pipeline = pdp.PdPipeline(stages=[
    pdp.ColDrop(pdp.cq.StartWith('n_'),
    pdp.ApplicationContextEnricher(scaling_type=scaling_decider),
    pdp.Scale(
        # fit=False means it will take it from the application context, and not fit context
        scaler=pdp.contextual('scaling_type', fit=False),  
        joint=True,
    ),
])

Design

This has to be implemented at the PdPipeline base class. Since the base class can't hijack constructor arguments, I think the contract with extending classes should be:

  1. When implementing class exteding PdPipeline, if you want to enjoy support for contextual constructor parameters, you MUST delay any initialization of inner state objects to fit_transform, so that fit/application context is available on initialization (it is NOT available at pipeline stage costruction and initialization, after all).

  2. pdp.contextual is a factory function that returns contextual parameter placeholder objects. Code using it shouldn't really care about it, as it should never interact with the resulting objects directly. I think.

  3. PdPipeline can auto-magically make sure that any attribute of a PdPipeline instance that is assigned a pdp.contextual object in the constructor (e.g. self.k = k, and the k constructor argument was provided with k=pdp.future('pca_k')) will be hot-swapped with a concrete value by the time we wish to use it fit_transform or transform (for example, when we call self.pca_ = PCA(k=self.k)). It can also do so for any such object that is contained in any iterable or dict-like attribute (so if I have self._pca_kwargs = {...} in my constructor, I can safely call self.pca_ = PCA(**self._pca_kwargs) in fit_transform().

Implementation thoughts

To make this efficient, since this means accessing sub-class attribute instance on pipeline transformations, I have a few thoughts:

  1. The contextuals module should have a global variable such as CONTEXTUALS_ARE_ON = False. Then, the pdp.contextual factory function sets global CONTEXTUALS_ARE_ON; CONTEXTUALS_ARE_ON = True when called. Then, we condition the whole inspection-heavy logic on this indicator variable, so that if our user never called pdp.contextual during the current kernel, runtime is saved.

  2. I first thought pdp.contextual could somehow register _ContextualParam objects in a registery we could use to find what needed to be swapped, but actually this wouldn't help, as they won't know which attribute of which pipeline stage they were assigned to.

  3. We thus have to scan sub-class attributes, but we can do so if and only after pdp.contextual was called, and right after pipeline stage initialization. Moreover, we can create a literal list of all attribute names we should ignore, stored in pdpipe.core as a global: e.g. _IGNORE_PDPSTAGE_ATT = ['_desc', '_name'], etc. Everything we know isn't an attribute the sub-class declared. Then, we can check any attribute that isn't one of these. This can be done in pdpipe.PdPipelineStage.__init__(), since it's called (by contract; we can demand that from extending subclasses) at the end of the __init__() method of subclasses. When we find that such an attribute holds a pdp.contextual, we register it at the pdp.contextuals module, in some global dict (or something more sophisticated), keyed by the attribute name. We can also registed the containig stage object.

Then, in a stage fit_transform and transform methods, if the current stage object is registered for contextual hot-swapping, we find the concrete contextual value of any attribute resigtered for this stage (in either the self.fit_context object or self.application_context object the pipeline injects all stages during applications of the pipeline) and hot-swap it literaly: This will look something like setattr(self, 'k', self.fit_context['pca_k']), since we're at pdpipe.PdPipelineStage.fit_transform(), and the self object is an instance of the subclass requiring the hot swap (in this case, pdp.Decompose).

@shaypal5 shaypal5 self-assigned this Jul 3, 2022
@shaypal5 shaypal5 changed the title Feature Request: Future params using application context Feature Request: Contextual params using application context Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant