Make PageCrawler a "howto" instead of built-in feature? #26

chillu · 2020-06-10T01:32:12Z

Looking at PageCrawler, that's a non-trivial amount of code (50NCLOC), so somewhat valuable for devs to avoid writing those particular XPATH incantations. But I don't think we need to bake the actual processing logic triggering this utility into the module. At the moment, DataObjectDocument has a page_content_field which gets populated conditionally on shouldCrawlPageContent().

My preference would be that we tell people how to add a getter to their Page class (or anything else that could be "crawled") which can instanciate a PageCrawler, and explicitly assign it to a field in the search schema of their choosing. The advantage of this approach is that the crawler can be adapted or replaced, e.g. filtering out certain tags, extracting image titles, etc. It's also closer to what we already recommend for text extraction on File records.

The text was updated successfully, but these errors were encountered:

chillu · 2020-06-10T01:42:47Z

If we wanted to avoid hardcoding this logic in DataObjectDocument, we could also generalise this as a "resolver" for a field.

interface DataObjectFieldResolver
{
  public function resolve(DataObject $obj): mixed;
}

Then we could have a PageCrawlerFieldResolver than can be applied through YAML configuration, rather than writing a new getter in PHP, and it can be used on multiple types (not only pages can have URLs). A resolver pattern could also be handy for more complex lookups, e.g. transforming a "many many tags" relationship into an array of tag titles

chillu added the enhancement New feature or request label Jun 10, 2020

chillu mentioned this issue Jun 12, 2020

Allow full page indexing #9

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PageCrawler a "howto" instead of built-in feature? #26

Make PageCrawler a "howto" instead of built-in feature? #26

chillu commented Jun 10, 2020

chillu commented Jun 10, 2020

Make PageCrawler a "howto" instead of built-in feature? #26

Make PageCrawler a "howto" instead of built-in feature? #26

Comments

chillu commented Jun 10, 2020

chillu commented Jun 10, 2020