Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PageCrawler a "howto" instead of built-in feature? #26

Open
chillu opened this issue Jun 10, 2020 · 1 comment
Open

Make PageCrawler a "howto" instead of built-in feature? #26

chillu opened this issue Jun 10, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@chillu
Copy link
Member

chillu commented Jun 10, 2020

Looking at PageCrawler, that's a non-trivial amount of code (50NCLOC), so somewhat valuable for devs to avoid writing those particular XPATH incantations. But I don't think we need to bake the actual processing logic triggering this utility into the module. At the moment, DataObjectDocument has a page_content_field which gets populated conditionally on shouldCrawlPageContent().

My preference would be that we tell people how to add a getter to their Page class (or anything else that could be "crawled") which can instanciate a PageCrawler, and explicitly assign it to a field in the search schema of their choosing. The advantage of this approach is that the crawler can be adapted or replaced, e.g. filtering out certain tags, extracting image titles, etc. It's also closer to what we already recommend for text extraction on File records.

@chillu chillu added the enhancement New feature or request label Jun 10, 2020
@chillu
Copy link
Member Author

chillu commented Jun 10, 2020

If we wanted to avoid hardcoding this logic in DataObjectDocument, we could also generalise this as a "resolver" for a field.

interface DataObjectFieldResolver
{
  public function resolve(DataObject $obj): mixed;
}

Then we could have a PageCrawlerFieldResolver than can be applied through YAML configuration, rather than writing a new getter in PHP, and it can be used on multiple types (not only pages can have URLs). A resolver pattern could also be handy for more complex lookups, e.g. transforming a "many many tags" relationship into an array of tag titles

@chillu chillu mentioned this issue Jun 12, 2020
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant