Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Make dataobjects index on relation changes #29

Open
unclecheese opened this issue Jun 29, 2020 · 9 comments
Open

RFC: Make dataobjects index on relation changes #29

unclecheese opened this issue Jun 29, 2020 · 9 comments

Comments

@unclecheese
Copy link

unclecheese commented Jun 29, 2020

Currently, the reindexing hook for a dataobject comes from onAfter(Un)Publish, which works fine for native fields, but when relationships have changed, this hook doesn't fire because the native record is not dirty. A good example of this is associating categories to a product via a CheckboxSetField. When the record saves, the manyMany list is updated, but the record itself is not.

There are multiple paths we could take to supporting this:

Option A: Enforce "through" relationships, and add module-level smarts

In this approach, we clearly document that many_many relationships listed as indexed fields must be related via "through". We then expand the dependency tracking API to become aware of these through objects and give them a custom hook, that says, "on save, update the owner index"

Pros

  • Simple opt-in that just works

Cons

  • Many projects still don't use "through" objects, and there is no clear migration path to using them.
  • We'd have to think about batching to ensure that multiple calls to add()/remove() don't instantiate multiple indexing requests. (AFAIK dataobject saves use setByIDList, which is already batched implicitly)
  • Wouldn't support has_many, but this is quite a rare case (usually has_many objects are added outside of page context)

Option B: Use the versioned-snapshots RelationDiffer

We encountered the same issue in versioned snapshots, needing a new snapshot of a page when its list of owned components changed. This was solved with an agnostic service called RelationDiffer (source). However agnostic, it still burdens the user with calculating the list of before and after relation states, which is handled in versioned-snapshots fairly imperatively. Still, extracting those computations into another agnostic service seems really plausible.

Pros:

  • New module that could potentially benefit other projects
  • Tested code that we know works already
  • Works for any type of relationship

Cons:

  • Do we really need another module and yet another dependency to manage?
  • Would require new code for not only search-service, but also versioned-snapshots, if we care about maintainability.

Option C: Injector sledgehammer

We overload ManyManyList, ManyManyThroughList and potentially HasManyList to do some indexing on add() remove() and setByIDList, and jam these implementations into Injector.

Pros:

  • Zero code for the user of the module. It just works
  • Would work with programatic writes as well as CMS user actions
  • Low implementation cost
  • Can do it right now, without waiting on a PR to the module

Cons:

  • Technical debt of maintaining subclasses
  • Inheritance over composition is bad

Option D: Patch the core and hope for a speedy merge

In this approach, we add extend() hooks to the RelationList classes to fire custom hooks on add, remove, and setByIDList. Historically, extend() hooks have been considered patch-level changes, and we're now doing incremental patch releases between recipe releases, which means this could become part of the core API in the very near future.

Pros:

  • Zero code, just works
  • Works with programmatic writes
  • Low implementation cost

Cons:

  • Would need another approach to bridge the gap, which would be a deliberate expansion of technical debt
  • If it doesn't get into a patch release, we're high and dry

cc @StephenMakrogianni @chillu

@michalkleiner
Copy link

Option D pretty please

@madmatt
Copy link
Member

madmatt commented Jun 30, 2020

Yeah I think Option D sounds good as well - as long as we couple it with changing documentation on methods like onAfterPublish etc to also note this for others (and perhaps on docs.silverstripe.org too).

@kinglozzer
Copy link
Member

Some related (recent) discussion to Option D: silverstripe/silverstripe-framework#1923 (comment)

@unclecheese
Copy link
Author

Cheers for that, @kinglozzer .. 10 year old proposal! Wow.

This actually does affect has_many in very meaningful way, and we ran into this with versioned-snapshots as well. ElementalEditor allows the user to do page saves that actually persist the changes to the blocks. This has the potential to really stuff things up with multiple writes to blocks, and firing the page level onAfterPublish.

Based on what I'm reading in that thread, I'm thinking a combination of B and D. The reason for that is because it seems rare that all you would need is a generic hook. More than likely, you'll want access to what changed, and what the changes were. For that, we have a pretty solid API in the versioned-snapshots module for extracting that information, and we should consider moving it into core.

So, that being said:

Phase 1: patch

  • Add onRelationAdded/Removed/Changed hook to all the relevant methods in RelationList
  • Get this into 4.6.1

Phase 2: enhance

  • Add a RelationDiffer instance as a parameter to the onRelationXXX extend hook that provides access to the change information. This would be lazy loaded so that there is no performance impact on every relation change.
$this->extend('onRelationChange', RelationDiffer::create($changes));
public function onRelationChange(RelationDiffer $differ)
{
  $differ->diff();
  $differ->hasChanges();
  $differ->getAdded(); 
  // etc...
  
}
  • Get this into 4.7.0

@sminnee
Copy link
Member

sminnee commented Jul 2, 2020

If not "move it to core", have you considered making versioned-snapshots a necessary part of this feature, and/or module? From my perspective, that would be fine.

@chillu
Copy link
Member

chillu commented Jul 7, 2020

Many projects still don't use "through" objects, and there is no clear migration path to using them.

That's an issue if we want versioned-snapshots to be successful, since it also relies on many_many_through, right? And unlike in this case here, we identified no workarounds for this. If you want versioned snapshots to track many_many, it needs to be with many_many_through.

Wouldn't support has_many, but this is quite a rare case (usually has_many objects are added outside of page context)

BlockArea has_many Block is the canonical use case, right? But there'll be many others, e.g. Gallery has_many GalleryItem. If you change the Block or GalleryItem, you'd expect their parents to be reindexed. I think we need to solve this for has_many and many_many_through at least.

Option B: [...] We'd have to think about batching to ensure that multiple calls to add()/remove() don't instantiate multiple indexing requests. (AFAIK dataobject saves use setByIDList, which is already batched implicitly)

We'd have to think about this with Option D as well, right?

If it doesn't get into a patch release, we're high and dry

I think for these early stage modules, relying on a pre-release branch is fine. If we point out the specific patch required, you can also stay on a stable version and just pull in via composer patches. I don't believe this should influence our decision.

This would be lazy loaded so that there is no performance impact on every relation change

Not sure about that. If you use the current RelationDiffer signature, it required $previousVersionMapping and $currentVersionMapping in the constructor (not lazy). If you're using setByIDList() you're already calculating the former (through column()), but any removals would require changes to the DataList API surface (return values). Calls to remove(), removeMany() and removeAll() would now need to call column() before the operation, regardless if anyone is listening to the event or not. Relationship modification events are worthwhile for core separately from the search and versioned-snapshots use cases, but I can't see how we avoid the performance impact if we provide a RelationDiffer with those events in core.

ManyMany list also doesn't have any API surface to distinguish re-adding existing relationships, or removing unknown records. Those edge cases would both be considered "changes" at the moment. Not a big deal in terms of doing a few extra indexing requests of course.

Option E: Use versioned-snapshots as the source of changes

I was wondering if we could "just" listen to versioned-snapshots creation events, since that already does the hard work of following ownership relationships to create a snapshot. It would also take care of the "batching" aspects, because we are triggering versioned-snapshot creation on higher level events like "save page" or "gridfield action" rather than lowlevel ORM events. We wouldn't know how the changed records were related to each other from inspecting a versioned-snapshot record, but I reckon we don't need need that.

Example (with content from all items being implicitly indexed in a unified Page search index):

Page has_many Block
Page owns Block
Block many_many Tag
Block owns Tag
Block has_one Contact
Block owns Contact
  • Action "Add a Block to Page": Snapshot contains Block, Page
  • Action "Update Contact": Snapshot contains Contact, Block, Page
  • Action "Remove Tag from Block": Snapshot contains Tag, Block, Page

In all of these actions, we would identify that a Page record is involved, and trigger a reindex. This would need to happen regardless which fields are actually indexed on Page though, because it's too expensive to calculate. Which seems like an acceptable limitation? We could provide hooks to make this decisioning more granular in specific implementations. For example, if you weren't indexing the Contact, you could check only for relations you're interested in on the snapshot record (or direct changes to the Page). There's no guarantee that e.g. a Page and Tag record in the same snapshot item are actually relation through a relationship you're indexing, but I'd say it's "close enough".

@madmatt
Copy link
Member

madmatt commented Jul 9, 2020

Yeah I like this idea, and I'm of the opinion that over-indexing (e.g. indexing even if nothing relevant has changed) is better than the alternative.

I think one additional question to ask: What exactly are we indexing? My assumption is that we're indexing Page only for most use-cases, but are we also indexing Block as it's own separate content type? Both options have advantages and disadvantages, and we've come across this before (with fulltextsearch and Solr) so I'm not sure if we want to lock ourselves into a decision here - better to keep options open (e.g. for Option E maybe that means we end up re-indexing both the Block and the Page object - but that's still better than the alternative).

@michalkleiner
Copy link

From our projects' perspectives, we definitely need to keep the options open to a degree that Pages don't need to be indexed, e.g. using the search service just for a register of dataobjects etc.

@chillu
Copy link
Member

chillu commented Jul 9, 2020

What exactly are we indexing? My assumption is that we're indexing Page only for most use-cases, but are we also indexing Block as it's own separate content type?

I think that's up to the implementation. I'd expect that most projects will simply "denormalise" blocks by rendering them out as HTML through the PageCrawler utility built into this module. So taking my example, you couldn't facet by Tags on Blocks, but if you render out the tag labels in your HTML, they'd be findable as part of the Page record. There's nothing stopping you from indexing Blocks or even Tags in this example as separate entities, but in most cases that'll be overkill. One limitation there is that you can only index scalars and arrays of scalars in App Search:

Nested objects are not supported. You must flatten them.
Arrays are supported.

https://swiftype.com/documentation/app-search/api/overview#schema-design

From our projects' perspectives, we definitely need to keep the options open to a degree that Pages don't need to be indexed, e.g. using the search service just for a register of dataobjects etc.

Yeah to my knowledge there's nothing page specific about versioned-snapshots or what we're discussing here, it's just an example. versioned-snapshots hooks into controller rather than model level events, and there's defaults for page editing through CMSMain as well as various GridField actions. So it matter more how you're editing your records, rather than what those records are.

@madmatt madmatt mentioned this issue Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants