notifications from resolver CF updates #56

jerch · 2020-07-19T11:36:20Z

Coming from the discussion in #49. We should establish some way of getting notified when CFs are getting updated by the resolver. There are several ways to achieve this, mainly custom signals and/or method hooks/overloads.

To get reliable behavior any follow-up code can work with, several things have to be considered, before we can place a hook or trigger a custom signal:

general resolver layout
atomicity of resolver updates
needed data in signal follow-up code
where and how to declare the signal tracking

general resolver layout

The resolver currently works as a DFS resolving and updating CFs in the dependency graph on descent. There is no explicit backtracking / ascending work done atm. The logic is as follows (pseudocode):

update_dependent(input_queryset, update_fields):
  # resolve dependent CFs (queryset construction)
  targets = _queryset_for_update(input_queryset, update_fields)
  # walk nodes on current dependency tree level
  for target_queryset, computed_fields in targets:
    bulk_updater(target_queryset, computed_fields)

bulk_updater(target_queryset, computed_fields):
  # some queryset local code (local MRO, apply select/prefetch lookups)
  # walk and recalc all fields in MRO and records (building a [fields x records] matrix)
  ...
  # save in batches with `bulk_update`
  target_queryset.bulk_update(changed_data)
  if target_queryset:    # recursion exit if queryset is empty (needed for cycling deps)
    # descent to next level in dependency tree
    update_dependent(target_queryset, computed_fields)

atomicity of resolver updates

As shown in the resolver layout above, the atomic update data type is a queryset filtered by dependency constraints updating certain computed fields on that model. This could be an entry for a custom signal / method hook. But there are several issues in doing it at that level:

we are still in the middle of the update cascade, CFs are only partially synced
follow-up code triggered from such a signal would have to beware of the dependency tree itself, and the current position in the tree, which is really bad and hard to deal with

Imho hooking into DFS runs is a bad idea, it should not be the official way offered by the API. It still could be done by overloading update_dependent or bulk_updater yourself (if you know what you are doing).

From an outer perspective resolver updates should be atomic for a full update tree, not its single nodes. Due to the nature of spread updates across several models, this is somewhat hard to achieve. (Directly linked to this is the question about concurrent access and whether the sync state of computed fields can be trusted. Also see #55.)

To get atomicity for whole tree updates working with custom signals/method hooks, we basically could call into it at the end of the top level update_dependent invocation. The work data would have to be collected somehow (now with backtracking).

needed data in signal follow-up code

With a signal on whole tree update level, this question gets ugly, since the backtracking would have to carry the updated data along. My suggestion not to waste too much memory - every node in the update tree simply places entries with pks into a container. The entries could look like this:

{
  model1: {
    'comp1': set(pks_affected),
    'comp2': set(other_pks),
  },
  model2: {
    'compX': set(some_pks)
  }
}

The signal finally could get the container as argument and can work with it.

where and how to declare the signal tracking

Some ideas regarding this:

data tracking should be off by default (to save memory from nonsense data collecting)
place a keyword argument on @computed like resolver_signal=True to indicate collecting its updates during resolver runs
resolver has one resolver-wide default signal, which would trigger if data was collected

If we run into memory issues for quite big updates (really huge pk lists), we might have to find a plan B on how to aggregate the updated data.

@mobiware, @olivierdalang Up for discussion.

The text was updated successfully, but these errors were encountered:

mobiware · 2020-07-20T22:00:13Z

@jerch I feel the proposed structure would require a bit of work if we want to know which model instances have been changed.

I was thinking of a structure more like:

{
    model1: {
        pk1: ["changed_field", "other_changed_field"],
        pk2: ["recomputed_field"]
    },
    model2: {
        pkN: ["changed_field", "computed_field"]
    }
}

i.e. 1st level structure is a dict with models as keys, then 2nd level structure has model instance PKs as keys, and list of changed fields as values.
That would make it straightforward to know:

which model objects have been impacted
for each model object, which field has been impacted

jerch · 2020-07-21T09:45:06Z

@mobiware I am afraid, that a data structure shaped this way will bloat the memory for holding CF names in individual lists for every changed PK. In #58 I have implemented a first version, which simply aggregates as follows:

{
    model1: {
        fieldset1: set_of_pks_1,
        fieldset2: set_of_pks_2,
    },
    model2: {...}
}

This is quite efficient, as it stores the PKs directly from bulk_updater calls. Here CFs are unlikely to contain field intersections, as the TR on graph level tries to eliminate multiple fields accesses from a subtree. Of course a single PK can be contained in several sets, as a CF further down in the tree will be updated later.

While this data structure is quite efficient for aggregation and being passed along, it is still somewhat hard to extract certain bits of information:

all altered pks on a model is the sum of the pk sets
all altered pks on a model for a certain CF is the sum of the pk sets if filtered fieldsets containing the CF

I also thought about a signal to register on a CF update directly, something like that:

def handler(sender, pks, **kwargs): pass
post_field_update.connect(handler, sender=(model, 'compX'))

thus it gets all pks readily aggregated from updates of that field (currently not implemented).

But at that point I have no clue about the typical signal usage here. I see several levels of interesting data here:

models involved in the update
pks on models being touched
fields on models
pks on fields on models

jerch added the enhancement New feature or request label Jul 19, 2020

This was referenced Jul 19, 2020

Add support for Django 2.1 #49

Open

custom resolver signals #58

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notifications from resolver CF updates #56

notifications from resolver CF updates #56

jerch commented Jul 19, 2020 •

edited

mobiware commented Jul 20, 2020

jerch commented Jul 21, 2020 •

edited

notifications from resolver CF updates #56

notifications from resolver CF updates #56

Comments

jerch commented Jul 19, 2020 • edited

mobiware commented Jul 20, 2020

jerch commented Jul 21, 2020 • edited

jerch commented Jul 19, 2020 •

edited

jerch commented Jul 21, 2020 •

edited