Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

notifications from resolver CF updates #56

Open
jerch opened this issue Jul 19, 2020 · 2 comments · May be fixed by #58
Open

notifications from resolver CF updates #56

jerch opened this issue Jul 19, 2020 · 2 comments · May be fixed by #58
Labels
enhancement New feature or request

Comments

@jerch
Copy link
Collaborator

jerch commented Jul 19, 2020

Coming from the discussion in #49. We should establish some way of getting notified when CFs are getting updated by the resolver. There are several ways to achieve this, mainly custom signals and/or method hooks/overloads.

To get reliable behavior any follow-up code can work with, several things have to be considered, before we can place a hook or trigger a custom signal:

  • general resolver layout
  • atomicity of resolver updates
  • needed data in signal follow-up code
  • where and how to declare the signal tracking

general resolver layout

The resolver currently works as a DFS resolving and updating CFs in the dependency graph on descent. There is no explicit backtracking / ascending work done atm. The logic is as follows (pseudocode):

update_dependent(input_queryset, update_fields):
  # resolve dependent CFs (queryset construction)
  targets = _queryset_for_update(input_queryset, update_fields)
  # walk nodes on current dependency tree level
  for target_queryset, computed_fields in targets:
    bulk_updater(target_queryset, computed_fields)

bulk_updater(target_queryset, computed_fields):
  # some queryset local code (local MRO, apply select/prefetch lookups)
  # walk and recalc all fields in MRO and records (building a [fields x records] matrix)
  ...
  # save in batches with `bulk_update`
  target_queryset.bulk_update(changed_data)
  if target_queryset:    # recursion exit if queryset is empty (needed for cycling deps)
    # descent to next level in dependency tree
    update_dependent(target_queryset, computed_fields)

atomicity of resolver updates

As shown in the resolver layout above, the atomic update data type is a queryset filtered by dependency constraints updating certain computed fields on that model. This could be an entry for a custom signal / method hook. But there are several issues in doing it at that level:

  • we are still in the middle of the update cascade, CFs are only partially synced
  • follow-up code triggered from such a signal would have to beware of the dependency tree itself, and the current position in the tree, which is really bad and hard to deal with

Imho hooking into DFS runs is a bad idea, it should not be the official way offered by the API. It still could be done by overloading update_dependent or bulk_updater yourself (if you know what you are doing).

From an outer perspective resolver updates should be atomic for a full update tree, not its single nodes. Due to the nature of spread updates across several models, this is somewhat hard to achieve. (Directly linked to this is the question about concurrent access and whether the sync state of computed fields can be trusted. Also see #55.)

To get atomicity for whole tree updates working with custom signals/method hooks, we basically could call into it at the end of the top level update_dependent invocation. The work data would have to be collected somehow (now with backtracking).

needed data in signal follow-up code

With a signal on whole tree update level, this question gets ugly, since the backtracking would have to carry the updated data along. My suggestion not to waste too much memory - every node in the update tree simply places entries with pks into a container. The entries could look like this:

{
  model1: {
    'comp1': set(pks_affected),
    'comp2': set(other_pks),
  },
  model2: {
    'compX': set(some_pks)
  }
}

The signal finally could get the container as argument and can work with it.

where and how to declare the signal tracking

Some ideas regarding this:

  • data tracking should be off by default (to save memory from nonsense data collecting)
  • place a keyword argument on @computed like resolver_signal=True to indicate collecting its updates during resolver runs
  • resolver has one resolver-wide default signal, which would trigger if data was collected

If we run into memory issues for quite big updates (really huge pk lists), we might have to find a plan B on how to aggregate the updated data.

@mobiware, @olivierdalang Up for discussion.

@jerch jerch added the enhancement New feature or request label Jul 19, 2020
This was referenced Jul 19, 2020
@mobiware
Copy link
Contributor

@jerch I feel the proposed structure would require a bit of work if we want to know which model instances have been changed.

I was thinking of a structure more like:

{
    model1: {
        pk1: ["changed_field", "other_changed_field"],
        pk2: ["recomputed_field"]
    },
    model2: {
        pkN: ["changed_field", "computed_field"]
    }
}

i.e. 1st level structure is a dict with models as keys, then 2nd level structure has model instance PKs as keys, and list of changed fields as values.
That would make it straightforward to know:

  • which model objects have been impacted
  • for each model object, which field has been impacted

@jerch
Copy link
Collaborator Author

jerch commented Jul 21, 2020

@mobiware I am afraid, that a data structure shaped this way will bloat the memory for holding CF names in individual lists for every changed PK. In #58 I have implemented a first version, which simply aggregates as follows:

{
    model1: {
        fieldset1: set_of_pks_1,
        fieldset2: set_of_pks_2,
    },
    model2: {...}
}

This is quite efficient, as it stores the PKs directly from bulk_updater calls. Here CFs are unlikely to contain field intersections, as the TR on graph level tries to eliminate multiple fields accesses from a subtree. Of course a single PK can be contained in several sets, as a CF further down in the tree will be updated later.

While this data structure is quite efficient for aggregation and being passed along, it is still somewhat hard to extract certain bits of information:

  • all altered pks on a model is the sum of the pk sets
  • all altered pks on a model for a certain CF is the sum of the pk sets if filtered fieldsets containing the CF

I also thought about a signal to register on a CF update directly, something like that:

def handler(sender, pks, **kwargs): pass
post_field_update.connect(handler, sender=(model, 'compX'))

thus it gets all pks readily aggregated from updates of that field (currently not implemented).

But at that point I have no clue about the typical signal usage here. I see several levels of interesting data here:

  • models involved in the update
  • pks on models being touched
  • fields on models
  • pks on fields on models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants