Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prep DataModel for removal #1102

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Sep 23, 2022

Split off from #1098

This makes it easier to create Variable instances from variable definitions, because Variables are responsible for both predicates and distance functions. So, if we split DataModel into these two separate parts, we're going to need to do something like

# In eg Dedupe or RecordLink 
vars = typify_variables(var_defs)
predicates = [v.predicates() for v in vars]
featurizer = Featurizer(vars)

so it would be important that we can create our variable instances once, and then pass those around to the other parts.

It isn't used publicly anywhere,
and it adds more API surface area to worry about. This is still
backwards compatible, as self._len
is still saved/restored
byt pickle the same way, now we just access
it directly instead of going via __len__().
We iterate through the input more than once, so an Iterable
is insufficient.
Before, we never checked for this,
so when we called list.index() we
just got back the first instance.

I swapped us over to a lookup table because that makes more
sense, but I wanted to also add this check because if someone had
duplicate names, this would quitely change the behavior underneath
someone: before it gave the first index, now it gives the last.
These variables
1. you can't currently create them because they inherit from Variable,
not FieldVariable, so they don't appear in the
datamodel.VARIABLE_CLASSES
lookup table
2. You SHOULDN'T be able to instantiate these variables directly
from a variable definition
Before the flow was:
1. create all the primary variables EXCEPT for the interactions
2. Expand those.
3. go back through and NOW create the InteractionVariables

I found this confusing. We parse the var definitions twice in separate
places, and InteractionVariables
are a special case in the first pass.

The other point of this is to make it so that there
is one list of Variable instances
which is the single source of truth. Once we do this
then we can turn all the other private instance variables of DataModel
into @functools.cached_property's,
which will make it much easier to
ensure pickle compatibility in
the future, because
only one variable needs to get
saved and restored
all_variables.append(variable_object)

if only_custom:
no_blocking_variables = all(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing from #1098 (comment)

I think it would be better if this was more duck-typed though, such as
no_blocking_variables = not any(len(v.predicates) for v in variables).

That would be a slight change though, since Variable.predicates is filled with an ExistsPredicate for any var def with the has missing flag. Perhaps this is OK? But I don't think so.

I think really there needs to be a more official API where variables declare the predicates they export.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code that add the ExistsPredicate should only really be on FieldPredicates as it depends on there being a field.

https://github.com/dedupeio/dedupe/blob/main/dedupe/variables/base.py#L37

if we move that code, then we could do pretty much what you are saying.

no_blocking_variables = not any(len(v.predicates) for v in variables)

we might also want to have a check like

only_exists = all(isinstance(pred, ExistsPredicate) for v in variables for pred in v.predicates)
if only_exists:
    warning.warn('You are only using the Exists Predicate which is usually pretty bad')

@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 73.315% when pulling 5482c21 on NickCrews:datamodel-prep-for-removal into 522e7b2 on dedupeio:main.

all_variables += interactions(variable_definitions, self.primary_variables)
variables = typify_variables(variable_definitions)
non_interactions: list[FieldVariable] = [
v for v in variables if not isinstance(v, InteractionType) # type: ignore[misc]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where I think a Variable should have a more official API for if they export if they are an Interaction or not. Maybe should think of a better name for this too. Atomic vs derived? leaf vs nonleaf? base vs derived?

no_blocking_variables = all(
isinstance(v, (CustomType, InteractionType)) for v in variables
)
if no_blocking_variables:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this check should happen in DataModel.init, and not in this utility function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems of a piece with the "Incorrect variable specification: variable" error?

var_d = {var.name: var for var in primary_variables}

def _expanded_interactions(variables: list[Variable]) -> list[InteractionType]:
field_vars = {var.name: var for var in variables if isinstance(var, FieldVariable)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this should feels like it needs a more official API, or should be responsibility of the InteractionType

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can think of a few designs here, but this code is actually doing two things for us, both of which are kind of important.

  1. isolating the interaction features for us, so we can put them in the right place in the sequence of features.
  2. actually resolving the feature indices.

seems like we kind of need to do both.

all_variables.append(variable_object)

if only_custom:
no_blocking_variables = all(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code that add the ExistsPredicate should only really be on FieldPredicates as it depends on there being a field.

https://github.com/dedupeio/dedupe/blob/main/dedupe/variables/base.py#L37

if we move that code, then we could do pretty much what you are saying.

no_blocking_variables = not any(len(v.predicates) for v in variables)

we might also want to have a check like

only_exists = all(isinstance(pred, ExistsPredicate) for v in variables for pred in v.predicates)
if only_exists:
    warning.warn('You are only using the Exists Predicate which is usually pretty bad')

no_blocking_variables = all(
isinstance(v, (CustomType, InteractionType)) for v in variables
)
if no_blocking_variables:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems of a piece with the "Incorrect variable specification: variable" error?

result.extend(variable.higher_vars)
else:
result.append(variable)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we added an iter method to Variable so that we could just do something like.

features = []
for variable in variables:
   for feature in variable:
       features.append(feature)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea in general, but I would want to name it to differentiate between predicates and features, since Variables supply both. Perhaps Variable.features?

var_d = {var.name: var for var in primary_variables}

def _expanded_interactions(variables: list[Variable]) -> list[InteractionType]:
field_vars = {var.name: var for var in variables if isinstance(var, FieldVariable)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can think of a few designs here, but this code is actually doing two things for us, both of which are kind of important.

  1. isolating the interaction features for us, so we can put them in the right place in the sequence of features.
  2. actually resolving the feature indices.

seems like we kind of need to do both.

@@ -58,21 +58,16 @@ def all_subclasses(


class DerivedType(Variable):
type = "Derived"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but my understanding is:

This would only be used if someone passes in a variable definition like {"type": "Derived"}, and then we would create an instance of DerivedType, but that doesn't make sense because DerivedType is an abstract base class and shouldn't ever be created directly.

def __init__(self, definition: VariableDefinition):
self.name = "(%s: %s)" % (str(definition["name"]), str(definition["type"]))
super(DerivedType, self).__init__(definition)


class MissingDataType(Variable):
type = "MissingData"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as for DerivedType

@fgregg
Copy link
Contributor

fgregg commented Sep 24, 2022

thanks for working through this @NickCrews. this is some of the first code i wrote for project and there's a lot i didn't know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants