Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split up Datamodel into predicates, rename to Featurizer #1088

Open
NickCrews opened this issue Sep 1, 2022 · 5 comments
Open

Split up Datamodel into predicates, rename to Featurizer #1088

NickCrews opened this issue Sep 1, 2022 · 5 comments

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Sep 1, 2022

Branching off from thoughts at #1079 (comment)

Datamodel currently has two responsibilities:

  • holding candidate predicates
  • calculating feature vectors between pairs

We should split this up. The predicate holder I think could just be a simple python set of predicates. The feature generator could be called Featurizer or something (not Comparator or similar, since that sounds too much like it generates a 0-1 score, which is the Classifier's job).

We could still generate these two things from variable definitions, but they should be stored separately. This would help for instance in labeler.py, when we are passing around a Datamodel willy nilly, when really what we care about is either the predicates for the BlockLearner, or the feature vectors for the RLRLearner. We should only pass around the banana, not the entire gorilla holding the banana.

By doing this, it would also allow us to better abstract the two separate steps of

  • Generating feature vectors between pairs (Featurizer)
  • Using these feature vectors to calculate the 0-1 similarity score (Classifier)

For instance, I would like to be able to modify one of these steps, and it currently is a bit difficult to get in the middle and modify what's happening. To be precise, I want to add hard rules, such as "if both first name and address match, then mark similarity as 1" (I can add a custom variable for this that helps, but this doesn't guarantee that they get a score of 1).

@fgregg
Copy link
Contributor

fgregg commented Sep 2, 2022

i think that's a nice idea, nick! i would love to see that.

@NickCrews
Copy link
Contributor Author

NickCrews commented Sep 16, 2022

If we remove the DataModel class, it's going to be difficult to remain backwards compatible with the settings file loading in api.py. eg the self.data_model = pickle.load(settings_file) I think we would need to keep some skeleton of the DataModel class around in the same location it is now, and add in some converters to this new architecture. Doable, but gross.

What do you think @fgregg, would this change need to be backwards compatible?

@NickCrews
Copy link
Contributor Author

NickCrews commented Sep 23, 2022

Thinking about this more, I think it might make sense to combine the Featurizer and Classifier into one object, FeatureScorer, which is a concrete implementation of a Scorer protocol. Dedupe and the higher classes should only operate on the
Scorer protocol, and not know about how under the hood the pairs are getting featurized and classified:

class Scorer(Protocol):
    def score(self, pairs: Pairs) -> Scores:
        ...

class FeatureScorer(Scorer):
    def __init__(self, featurizer, classifier):
        self.featurizer = featurizer
        self.classifier = classifier

    def score(self, pairs):
        # or
        # features = self.featurizer.featurize(pairs)
        # not sure if it's worth defining an entire protocol
        # for featurizing
        features = self.featurizer(pairs)
        return self.classifier(features)

By doing this, it allows someone (ie me :) ) to plug in their own scorer that operates in a different way. For instance, I have some hardcoded rules that I want to enforce ("if first names don't match, then give a score of 0") and this logic is impossible with the current featurize/classify pipeline we have now.

@fgregg
Copy link
Contributor

fgregg commented Sep 24, 2022

i think it's right that the featurizer and classifer are always going to be very coupled. so that makes sense!

@fgregg
Copy link
Contributor

fgregg commented Sep 24, 2022

one thing i have been thinking about, increasingly, is blocking as a valuable feature for scoring #1103, so that's a place where you might want to loosen the coupling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants