Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FunctionTransformer diagram representation #29032

Open
timvink opened this issue May 16, 2024 · 8 comments
Open

Improve FunctionTransformer diagram representation #29032

timvink opened this issue May 16, 2024 · 8 comments

Comments

@timvink
Copy link
Contributor

timvink commented May 16, 2024

Describe the workflow you want to enable

Currently, using multiple FunctionTransformers in a pipeline leads to an uninformative view:

import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

df = pd.DataFrame([[1,2,3], [4,5,6]], columns=['one','two','three']) # sample data
def a(df): return df+1 # 1st transformer
def b(df): return df*10 # 2nd transformer

make_pipeline(FunctionTransformer(a), FunctionTransformer(b))

image

I would like to see the name of the function being used in the visual blocks

Describe your proposed solution

I would like to see something like this:

image

(or perhaps Function(<name of function>) or <name of function>() or FunctionTransformer_<name of function>)

A sample implementation might be look like this:

from sklearn.preprocessing import FunctionTransformer
from sklearn.utils._estimator_html_repr import _VisualBlock
from functools import partial

class PrettyFunctionTransformer(FunctionTransformer):
    def _sk_visual_block_(self):
        return _VisualBlock(
            "single",
            self,
            names=self.func.func.__name__ if isinstance(self.func, partial) else self.func.__name__,
            name_details=str(self),
        )

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@timvink timvink added Needs Triage Issue requires triage New Feature labels May 16, 2024
@glemaitre glemaitre removed the Needs Triage Issue requires triage label May 16, 2024
@glemaitre
Copy link
Member

Some of your suggestions are available when clicking on the transformers:

image

I don't know if we should treat specifically FunctionTransformer since this is a really generic transformer and extract out the information to display it at the first level.

@timvink
Copy link
Contributor Author

timvink commented May 17, 2024

The use case I envision is defining a scikit-learn pipeline for feature engineering. Feature engineering should be done on the training data, but also at inference time. If you add it to the model pipeline, you get 1) easier deployment (no preprocessing) and 2) safer pipelines, as feature engineering would be applied on each split separately. If you use the memory argument to cache the feature engineering pipeline, you also don't get the downside of repeating the same computations.

These two pipelines are identical, but the visualization on the right is much clearer:

@glemaitre
Copy link
Member

Convinced, we need to work the details but definitely this is better. I still think we should have the info that this is a FunctionTransformer in some way.

@Charlie-XIAO
Copy link
Contributor

Lazier way Harder way
image image
Just @timvink's implementation plus including the class name, wrapping in an inline-block and setting white-space: pre-wrap. Directly fits into the framework. Maybe look a bit better? But this requires altering the structure a bit. In particular, adding a parameter caption to the visual blocks (default None) and render in the HTML.

@glemaitre
Copy link
Member

I think that I better the harder way (unfortunately :)).

@timvink
Copy link
Contributor Author

timvink commented May 31, 2024

I also like 'the harder way' better.

Two further possible improvements:

  1. switch the titles: 'FunctionTransformer' should be the caption and the function names the titles. This way the repetition is in the small font and the transformer func name in the big
  2. show the partial function name. I don't think there is added value in showing that a function is a partial without showing the original function name. It's the same function but with different defaults.. we can just use the .func.__name__. We use partials a lot as we create pipelines from configuration files (using hydra instantiate)

@Charlie-XIAO
Copy link
Contributor

I a so think that I better the harder way (unfortunately :)).

It's actually "fortunately" for me as I also like the harder way but afraid that people don't think it's worth the complexity 🤣

I also like 'the harder way' better.

Thanks for confirmation.

  1. switch the titles: 'FunctionTransformer' should be the caption and the function names the titles. This way the repetition is in the small font and the transformer func name in the big

This is what I initially did, but then I found the info icon tooltip actually shows "documentation of {name}" which in that case would be "documentation of func name" which I think is improper. I will definitely consider this if I can find an easy way to tweak the info icon tooltip text individually.

  1. show the partial function name. I don't think there is added value in showing that a function is a partial without showing the original function name. It's the same function but with different defaults.. we can just use the .func.__name__. We use partials a lot as we create pipelines from configuration files (using hydra instantiate)

This I'm hesitant. I do agree that partial(...) does not provide (sufficient) useful information, but it's hard to consider all corner cases given that partial is not the only other way to construct a function. E.g. np.vectorize would need func.ufunc.__name__. What about partial of partial, partial of partial of partial, vectorize of partial, etc.?

@timvink
Copy link
Contributor Author

timvink commented Jun 1, 2024

I found the info icon tooltip actually shows "documentation of {name}" which in that case would be "documentation of func name" which I think is improper

Checking the Developer API for HTML representation it seems we could overwrite the _doc_link_template and _doc_link_url_param_generator methods for FunctionTransformer

What about partial of partial, partial of partial of partial, vectorize of partial, etc.?

We can implement a recursive function for those edge cases, like so:

sample implementation for `get_function_name`
import numpy as np
import functools

def get_function_name(func):
    """
    Retrieves the name of a function, supporting `np.vectorize` and `functools.partial`, 
    including nested variations.
    """
    # Check if the function has a `__name__` attribute directly
    if hasattr(func, '__name__'):
        return func.__name__
    
    # Check for functools.partial
    if isinstance(func, functools.partial):
        return get_function_name(func.func)
    
    # Check for np.vectorize
    if isinstance(func, np.vectorize):
        return get_function_name(func.pyfunc)
    
    # Check if the function has a `__wrapped__` attribute (for other decorators)
    if hasattr(func, '__wrapped__'):
        return get_function_name(func.__wrapped__)
    
    # If all else fails, return a placeholder name or indication
    return "<unknown_function>"

# Example Usage:
def example_function(x):
    return x

partial_func = functools.partial(example_function, x=2)
vectorized_func = np.vectorize(example_function)
partial_vectorized_func = functools.partial(vectorized_func, x=2)

print(get_function_name(example_function))           # Output: example_function
print(get_function_name(partial_func))               # Output: example_function
print(get_function_name(vectorized_func))            # Output: example_function
print(get_function_name(partial_vectorized_func))    # Output: example_function
print(get_function_name(lambda x: x))    # Output: <lambda>

That will deal with the vast majority of functions and it has a fallback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Discussion
Development

No branches or pull requests

3 participants