Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing distinction between Static and non-Static APIs #1049

Open
NickCrews opened this issue Jun 8, 2022 · 2 comments
Open

Consider removing distinction between Static and non-Static APIs #1049

NickCrews opened this issue Jun 8, 2022 · 2 comments

Comments

@NickCrews
Copy link
Contributor

From #1045 (comment)

Why do we have the distinction between Static and non-Static classes? Is it to prevent re-training an already trained model? I don't think this needs to be enforced in the API. For example, in scikitlearn you can call fit() over and over again, and the previous learned settings are thrown out each time. Seems like we could enforce a similar contract.

@fgregg do you remember the history of the design choice was made to keep them separate? It must have been an explicit choice for some reason, and I'm thinking I'm just missing it.

Benefits of merging that I see would be

  • simpler code / class heirarchy (medium significance)
  • simpler API for users (low/medium significance)
  • ability to add more training data and continue training between sessions (medium significance?)
@fgregg
Copy link
Contributor

fgregg commented Jun 11, 2022

there's a few issues related to pickling either kind of object, and then some issues related to the differences between the objects.

Common issues

  • the indexes are not really pickle-able. they have very recursive data structures or other properties that mean they can't be serialized by pickle. so right off the bat, the best you can do is something like.
pickle.dump('obj.pickle', obj)
new_obj = pickle.load('obj.pickle')
new_obj.index(data)

Issues related to the distinct classes


thinking through this a bit more, i wonder if an even more radical change might be interesting.

we make the Active Learner class a first class object in the API. we move all the methods related to learning to that class,
and then the train method returns a Dedupe class (similar to the current staticdedupe class) that only has methods for processing records.

ActiveLearner.train() -> Dedupe

This new Dedupe class could be directly pickleable.

There's a nice separation of concerns here, and it would also make it very natural to write a different Learner class that took in static training data, which is harder than it ought to be right now to fit into our current API.


Another way of slicing this up could be to split all the methods related to active learning into an ActiveLearner class, and than have a Dedupe class that had a fit_transform method but didn't keep a hold of any of the data.

Dedupe.fit_transform(**ActiveLearner.training_data)

that could be nice too. unfortunately the data that would need to be passed to the fit_transform method is complex. it's the labeled example data, but also a set of predicate weights.

@NickCrews
Copy link
Contributor Author

I like the idea of splitting up the ActiveLearner portion and the Dedupe/RecordLinkage/Gazeteer model portion.

Not very familiar, but it looks like this is what https://github.com/scikit-activeml/scikit-activeml does. Instead of ActiveLearner.train() -> Dedupe, it could be al = ActiveLearner(Dedupe()); al.train(); al.model.pickle()

did you mean fit_predict() instead of fit_transform()?

Does it make sense to think of this as three different layers?

  1. Model (does predictions, has predict())
  2. Learner (fits the model to specific data, has fit() and maybe partial_fit())
  3. ActiveLearner (decides which data to actually pass to Learner)

sklearn combines parts 1 and 2, and doesn't even touch 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants