Suggestion: Normalize HTML/XML entities #4523

lionel-rowe · 2024-03-26T09:08:14Z

Is your feature request related to a problem? Please describe.

Often, when consuming HTML or XML files from external sources, it's desirable to normalize the entities. For example, I'm interacting with an API that produces XML where all non-ASCII characters are encoded as numbered entities, making non-Latin-script text completely unreadable. I want to debug and store these files in a format that's human-readable as well as machine-readable, while remaining valid UTF-8 XML.

Describe the solution you'd like

Currently, html/entities exports escape and unescape functions. I suggest exporting a third function (tentatively named normalize) that normalizes all entities in a string of HTML or XML to a form that's valid, interoperable, and (mostly) human-readable:

normalize('<p>&#x4e24;&#x53ea;&#x5c0f;&#x871c;&#x8702;</p>') // '<p>两只小蜜蜂</p>'
normalize('a&b') // 'a&amp;b'
normalize('&#62;&#x3e;&gt;') // '&gt;&gt;&gt;'
normalize('&apos;') // '&#39;'

Describe alternatives you've considered

It might be worth having multiple normalized forms (which would likely also affect the API surface area of escape); for example, a "readability" form that converts 两只小蜜蜂 to 两只小蜜蜂 vs a "compatibility" form that converts in the opposite direction. I don't currently have a use case for the "compatibility" form as any XML-consuming APIs I need to interact with either default to UTF-8 or respect UTF-8 where specified, but it might be useful for users needing to interact with legacy or poorly-designed APIs.

The text was updated successfully, but these errors were encountered:

lionel-rowe added a commit to lionel-rowe/deno_std that referenced this issue Mar 26, 2024

feat(html): add function for HTML entities (denoland#4523)

38382f7

lionel-rowe added a commit to lionel-rowe/deno_std that referenced this issue Mar 26, 2024

feat(html): add normalize function for HTML entities (denoland#4523)

f726a8d

lionel-rowe mentioned this issue Mar 26, 2024

feat(html): add normalize function for HTML entities (#4523) #4524

Closed

iuioiua added the feedback welcome We want community's feedback on this issue or PR label Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Normalize HTML/XML entities #4523

Suggestion: Normalize HTML/XML entities #4523

lionel-rowe commented Mar 26, 2024 •

edited

Suggestion: Normalize HTML/XML entities #4523

Suggestion: Normalize HTML/XML entities #4523

Comments

lionel-rowe commented Mar 26, 2024 • edited

lionel-rowe commented Mar 26, 2024 •

edited