You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when indexing RichTextFields, or RichTextBlocks within StreamFields, the raw content including HTML tags is passed to the index. Since HTML tag names / attributes are not meaningful searchable content*, it would be better to provide a get_searchable_content method on these that applies Django's striptags filter.
More importantly, if you customise the Elasticsearch backend (or run raw Elasticsearch queries) to take advantage of Elasticsearch's highlighting support (see #5340), you currently end up with either stray HTML tags or spurious formatting in the results, depending on whether you passed 'encoder': 'html' in the query.
(* This isn't totally true - for example, you could make a good case for indexing the alt text on images - so an ideal solution would probably involve being able to specify custom rules at the richtext feature level to generate a plain text representation. However, on balance, I think stripping tags out is better than keeping them in.)
Steps to Reproduce
Search for 'h2' on https://www.rca.ac.uk/ . Observe that the results include many pages that contain an <h2> element but not the text "h2"...
I have confirmed that this issue can be reproduced as described on a fresh Wagtail project: no
The text was updated successfully, but these errors were encountered:
Issue Summary
Currently, when indexing RichTextFields, or RichTextBlocks within StreamFields, the raw content including HTML tags is passed to the index. Since HTML tag names / attributes are not meaningful searchable content*, it would be better to provide a
get_searchable_content
method on these that applies Django'sstriptags
filter.More importantly, if you customise the Elasticsearch backend (or run raw Elasticsearch queries) to take advantage of Elasticsearch's highlighting support (see #5340), you currently end up with either stray HTML tags or spurious formatting in the results, depending on whether you passed
'encoder': 'html'
in the query.(* This isn't totally true - for example, you could make a good case for indexing the alt text on images - so an ideal solution would probably involve being able to specify custom rules at the richtext feature level to generate a plain text representation. However, on balance, I think stripping tags out is better than keeping them in.)
Steps to Reproduce
Search for 'h2' on https://www.rca.ac.uk/ . Observe that the results include many pages that contain an
<h2>
element but not the text "h2"...The text was updated successfully, but these errors were encountered: