Consider removing distinction between `Static` and non-`Static` APIs #1049

NickCrews · 2022-06-08T21:51:44Z

Why do we have the distinction between Static and non-Static classes? Is it to prevent re-training an already trained model? I don't think this needs to be enforced in the API. For example, in scikitlearn you can call fit() over and over again, and the previous learned settings are thrown out each time. Seems like we could enforce a similar contract.

@fgregg do you remember the history of the design choice was made to keep them separate? It must have been an explicit choice for some reason, and I'm thinking I'm just missing it.

Benefits of merging that I see would be

simpler code / class heirarchy (medium significance)
simpler API for users (low/medium significance)
ability to add more training data and continue training between sessions (medium significance?)

fgregg · 2022-06-11T14:12:15Z

there's a few issues related to pickling either kind of object, and then some issues related to the differences between the objects.

Common issues

the indexes are not really pickle-able. they have very recursive data structures or other properties that mean they can't be serialized by pickle. so right off the bat, the best you can do is something like.

pickle.dump('obj.pickle', obj)
new_obj = pickle.load('obj.pickle')
new_obj.index(data)

Issues related to the distinct classes

the active learner apparatus is a very big object. aside from the indexes associated with the learner, it is could all be serialized but it would still be very large. this memory pressure is probably part of what's going on with scoring pairs is much slower after training then after loading settings file. #977

thinking through this a bit more, i wonder if an even more radical change might be interesting.

we make the Active Learner class a first class object in the API. we move all the methods related to learning to that class,
and then the train method returns a Dedupe class (similar to the current staticdedupe class) that only has methods for processing records.

ActiveLearner.train() -> Dedupe

This new Dedupe class could be directly pickleable.

There's a nice separation of concerns here, and it would also make it very natural to write a different Learner class that took in static training data, which is harder than it ought to be right now to fit into our current API.

Another way of slicing this up could be to split all the methods related to active learning into an ActiveLearner class, and than have a Dedupe class that had a fit_transform method but didn't keep a hold of any of the data.

Dedupe.fit_transform(**ActiveLearner.training_data)

that could be nice too. unfortunately the data that would need to be passed to the fit_transform method is complex. it's the labeled example data, but also a set of predicate weights.

NickCrews · 2022-06-11T18:34:55Z

I like the idea of splitting up the ActiveLearner portion and the Dedupe/RecordLinkage/Gazeteer model portion.

Not very familiar, but it looks like this is what https://github.com/scikit-activeml/scikit-activeml does. Instead of ActiveLearner.train() -> Dedupe, it could be al = ActiveLearner(Dedupe()); al.train(); al.model.pickle()

did you mean fit_predict() instead of fit_transform()?

Does it make sense to think of this as three different layers?

Model (does predictions, has predict())
Learner (fits the model to specific data, has fit() and maybe partial_fit())
ActiveLearner (decides which data to actually pass to Learner)

sklearn combines parts 1 and 2, and doesn't even touch 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider removing distinction between `Static` and non-`Static` APIs #1049

Consider removing distinction between `Static` and non-`Static` APIs #1049

NickCrews commented Jun 8, 2022

fgregg commented Jun 11, 2022 •

edited

NickCrews commented Jun 11, 2022

Consider removing distinction between Static and non-Static APIs #1049

Consider removing distinction between Static and non-Static APIs #1049

Comments

NickCrews commented Jun 8, 2022

fgregg commented Jun 11, 2022 • edited

Common issues

Issues related to the distinct classes

NickCrews commented Jun 11, 2022

Consider removing distinction between `Static` and non-`Static` APIs #1049

Consider removing distinction between `Static` and non-`Static` APIs #1049

fgregg commented Jun 11, 2022 •

edited