Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fit_transform? #18

Open
ablaom opened this issue Mar 2, 2023 · 7 comments
Open

Add fit_transform? #18

ablaom opened this issue Mar 2, 2023 · 7 comments

Comments

@ablaom
Copy link
Member

ablaom commented Mar 2, 2023

See discussion at #16

@CameronBieganek
Copy link

CameronBieganek commented Oct 30, 2023

I think for transformers it would make sense to require exactly two methods:

  • fit_transform
  • transform

No need for a separate fit implementation for transformers. When an ML pipeline is first fit, all of the transformers in the pipeline have to do a fit and a transformation, so I don't think fit needs to be separate. And of course using fit_transform allows for optimizations, as @davidbp mentioned.

The signature of fit_transform would probably look like this:

fitted_transformer, Xout = fit_transform(transformer, Xin)

@ablaom
Copy link
Member Author

ablaom commented Oct 31, 2023

Thanks for that suggestion.

| No need for a separate fit implementation for transformers.

I dunno. It seems like a non-trivial complication to the API. No separate fit means some special-casing in model composition. In MLJ we expect every model to have a fit and this is pretty central to the learning network stuff. (If you can stomach it, see our paper here). Conceptually, predict and transform are treated very similarly - they're just functions depending on a learned parameter that you generate with fit.

If fit_transform is just sugar for transform(fit(...)) then I don't think it's justified in a basic interface. Every name we add to the namespace should work hard to justify it's existence. Can you think of a use case where the optimisations gained are significant? I can imagine one avoids some data conversions (like DataFrame -> matrix -> DataFrame), but with the right "data front end" (which I'm still thinking about) this issue would have a workaround.

[I know some have argued for just one "operation" (predict, say) for added simplification, but this goes too far, in my view. In the current LearnAPI proposal, predict is distinguished by the fact that the ouput is a (proxy for a ) "target", a general notion we make reasonably precise in the docs, and we enable dispatch on the type of proxy. transform need not have this interpretation, but can have an "inverse". As we see from sk-learn and MLJ, allowing algorithms to implement more than one operationpredict / transform /inverse_transform is both natural and useful.]

@ablaom
Copy link
Member Author

ablaom commented Oct 31, 2023

Okay, here's a variation on your idea that doesn't require adding to the namespace. Each transformer implements one transform and one fit:

Case 1: static (non-generalizing) transformers

fit(strategy, X) -> model # storing `transformed_X` and any inspectable byproducts of algorithm
transform(model) -> model.transformed_X

with a convenience fallback

transform(strategy, X) = transform(fit(strategy, X))

Case 2: generalizing transformers:

fit(strategy, X) -> model # storing `transformed_X` and `learned_parameters` and any inspectable byproducts of algorithm
transform(model, Xnew) -> transformed_Xnew # uses `model.learned_parameters`

with a convenience fallback

transform(strategy, X) = transform(fit(strategy, X).transformed_X)

@CameronBieganek
Copy link

I'm not sure we'd want to keep a reference to an intermediate transformed data set in a trained transformer. That would prevent the garbage collector from freeing that memory as long as the pipeline is still around.

It also feels conceptually a little muddy, but that's just a feeling that I haven't been able to put into more concrete terms yet. :)

@ablaom
Copy link
Member Author

ablaom commented Oct 31, 2023

Case 2: generalizing transformers:

fit(strategy, X) -> model # storing `learned_parameters` and any inspectable byproducts of algorithm
transform(model, Xnew) -> transformed_Xnew # uses `model.learned_parameters`

with a convenience fallback

transform(strategy, X) = transform(fit(strategy, X), X)

@CameronBieganek
Copy link

CameronBieganek commented Oct 31, 2023

Hmm, that form doesn't allow for optimizations, but, as you said, maybe there's not really that many fit-then-transform cases that get a large benefit from optimizations.

@ablaom
Copy link
Member Author

ablaom commented May 19, 2024

In #30 an implementation can explicitly overload transform(strategy, data) to provide a one-shot method with no issues. I think providing a universal fallback is a bad idea, as it could lead to type instabilities, and confusion in debugging ("hidden knowledge").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants