Split up Datamodel into predicates, rename to Featurizer #1088

NickCrews · 2022-09-01T18:11:36Z

Branching off from thoughts at #1079 (comment)

Datamodel currently has two responsibilities:

holding candidate predicates
calculating feature vectors between pairs

We should split this up. The predicate holder I think could just be a simple python set of predicates. The feature generator could be called Featurizer or something (not Comparator or similar, since that sounds too much like it generates a 0-1 score, which is the Classifier's job).

We could still generate these two things from variable definitions, but they should be stored separately. This would help for instance in labeler.py, when we are passing around a Datamodel willy nilly, when really what we care about is either the predicates for the BlockLearner, or the feature vectors for the RLRLearner. We should only pass around the banana, not the entire gorilla holding the banana.

By doing this, it would also allow us to better abstract the two separate steps of

Generating feature vectors between pairs (Featurizer)
Using these feature vectors to calculate the 0-1 similarity score (Classifier)

For instance, I would like to be able to modify one of these steps, and it currently is a bit difficult to get in the middle and modify what's happening. To be precise, I want to add hard rules, such as "if both first name and address match, then mark similarity as 1" (I can add a custom variable for this that helps, but this doesn't guarantee that they get a score of 1).

fgregg · 2022-09-02T02:23:10Z

i think that's a nice idea, nick! i would love to see that.

NickCrews · 2022-09-16T01:54:49Z

If we remove the DataModel class, it's going to be difficult to remain backwards compatible with the settings file loading in api.py. eg the self.data_model = pickle.load(settings_file) I think we would need to keep some skeleton of the DataModel class around in the same location it is now, and add in some converters to this new architecture. Doable, but gross.

What do you think @fgregg, would this change need to be backwards compatible?

NickCrews · 2022-09-23T21:28:39Z

Thinking about this more, I think it might make sense to combine the Featurizer and Classifier into one object, FeatureScorer, which is a concrete implementation of a Scorer protocol. Dedupe and the higher classes should only operate on the
Scorer protocol, and not know about how under the hood the pairs are getting featurized and classified:

class Scorer(Protocol):
    def score(self, pairs: Pairs) -> Scores:
        ...

class FeatureScorer(Scorer):
    def __init__(self, featurizer, classifier):
        self.featurizer = featurizer
        self.classifier = classifier

    def score(self, pairs):
        # or
        # features = self.featurizer.featurize(pairs)
        # not sure if it's worth defining an entire protocol
        # for featurizing
        features = self.featurizer(pairs)
        return self.classifier(features)

By doing this, it allows someone (ie me :) ) to plug in their own scorer that operates in a different way. For instance, I have some hardcoded rules that I want to enforce ("if first names don't match, then give a score of 0") and this logic is impossible with the current featurize/classify pipeline we have now.

fgregg · 2022-09-24T00:47:16Z

i think it's right that the featurizer and classifer are always going to be very coupled. so that makes sense!

fgregg · 2022-09-24T01:03:59Z

one thing i have been thinking about, increasingly, is blocking as a valuable feature for scoring #1103, so that's a place where you might want to loosen the coupling.

NickCrews mentioned this issue Sep 16, 2022

Reduce responsibilities of DataModel #1098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up Datamodel into predicates, rename to Featurizer #1088

Split up Datamodel into predicates, rename to Featurizer #1088

NickCrews commented Sep 1, 2022 •

edited

fgregg commented Sep 2, 2022

NickCrews commented Sep 16, 2022 •

edited

NickCrews commented Sep 23, 2022 •

edited

fgregg commented Sep 24, 2022

fgregg commented Sep 24, 2022

Split up Datamodel into predicates, rename to Featurizer #1088

Split up Datamodel into predicates, rename to Featurizer #1088

Comments

NickCrews commented Sep 1, 2022 • edited

fgregg commented Sep 2, 2022

NickCrews commented Sep 16, 2022 • edited

NickCrews commented Sep 23, 2022 • edited

fgregg commented Sep 24, 2022

fgregg commented Sep 24, 2022

NickCrews commented Sep 1, 2022 •

edited

NickCrews commented Sep 16, 2022 •

edited

NickCrews commented Sep 23, 2022 •

edited