You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking at PageCrawler, that's a non-trivial amount of code (50NCLOC), so somewhat valuable for devs to avoid writing those particular XPATH incantations. But I don't think we need to bake the actual processing logic triggering this utility into the module. At the moment, DataObjectDocument has a page_content_field which gets populated conditionally on shouldCrawlPageContent().
My preference would be that we tell people how to add a getter to their Page class (or anything else that could be "crawled") which can instanciate a PageCrawler, and explicitly assign it to a field in the search schema of their choosing. The advantage of this approach is that the crawler can be adapted or replaced, e.g. filtering out certain tags, extracting image titles, etc. It's also closer to what we already recommend for text extraction on File records.
The text was updated successfully, but these errors were encountered:
Then we could have a PageCrawlerFieldResolver than can be applied through YAML configuration, rather than writing a new getter in PHP, and it can be used on multiple types (not only pages can have URLs). A resolver pattern could also be handy for more complex lookups, e.g. transforming a "many many tags" relationship into an array of tag titles
Looking at
PageCrawler
, that's a non-trivial amount of code (50NCLOC), so somewhat valuable for devs to avoid writing those particular XPATH incantations. But I don't think we need to bake the actual processing logic triggering this utility into the module. At the moment,DataObjectDocument
has apage_content_field
which gets populated conditionally onshouldCrawlPageContent()
.My preference would be that we tell people how to add a getter to their
Page
class (or anything else that could be "crawled") which can instanciate aPageCrawler
, and explicitly assign it to a field in the search schema of their choosing. The advantage of this approach is that the crawler can be adapted or replaced, e.g. filtering out certain tags, extracting image titles, etc. It's also closer to what we already recommend for text extraction onFile
records.The text was updated successfully, but these errors were encountered: