RichTextField / RichTextBlock should strip out HTML tags from search-indexable content #6098

gasman · 2020-06-01T21:54:15Z

Issue Summary

Currently, when indexing RichTextFields, or RichTextBlocks within StreamFields, the raw content including HTML tags is passed to the index. Since HTML tag names / attributes are not meaningful searchable content*, it would be better to provide a get_searchable_content method on these that applies Django's striptags filter.

More importantly, if you customise the Elasticsearch backend (or run raw Elasticsearch queries) to take advantage of Elasticsearch's highlighting support (see #5340), you currently end up with either stray HTML tags or spurious formatting in the results, depending on whether you passed 'encoder': 'html' in the query.

(* This isn't totally true - for example, you could make a good case for indexing the alt text on images - so an ideal solution would probably involve being able to specify custom rules at the richtext feature level to generate a plain text representation. However, on balance, I think stripping tags out is better than keeping them in.)

Steps to Reproduce

Search for 'h2' on https://www.rca.ac.uk/ . Observe that the results include many pages that contain an <h2> element but not the text "h2"...

I have confirmed that this issue can be reproduced as described on a fresh Wagtail project: no

The text was updated successfully, but these errors were encountered:

acarasimon96 · 2020-06-01T23:13:22Z

Is it OK if I work on a PR that resolves this issue? I've already implemented the same possible solution above in my project a long time ago.

gasman · 2020-06-01T23:16:02Z

@acarasimon96 Absolutely, yes please!

lb- · 2020-06-02T10:38:38Z

Resolved via #6099

gasman added type:Bug component:Search labels Jun 1, 2020

acarasimon96 mentioned this issue Jun 2, 2020

Strip away HTML tags from RichText searchable content #6099

Closed

lb- closed this as completed in 48511a7 Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RichTextField / RichTextBlock should strip out HTML tags from search-indexable content #6098

RichTextField / RichTextBlock should strip out HTML tags from search-indexable content #6098

gasman commented Jun 1, 2020

acarasimon96 commented Jun 1, 2020

gasman commented Jun 1, 2020

lb- commented Jun 2, 2020

RichTextField / RichTextBlock should strip out HTML tags from search-indexable content #6098

RichTextField / RichTextBlock should strip out HTML tags from search-indexable content #6098

Comments

gasman commented Jun 1, 2020

Issue Summary

Steps to Reproduce

acarasimon96 commented Jun 1, 2020

gasman commented Jun 1, 2020

lb- commented Jun 2, 2020