Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Normalize HTML/XML entities #4523

Open
lionel-rowe opened this issue Mar 26, 2024 · 0 comments
Open

Suggestion: Normalize HTML/XML entities #4523

lionel-rowe opened this issue Mar 26, 2024 · 0 comments
Labels
feedback welcome We want community's feedback on this issue or PR

Comments

@lionel-rowe
Copy link
Contributor

lionel-rowe commented Mar 26, 2024

Is your feature request related to a problem? Please describe.

Often, when consuming HTML or XML files from external sources, it's desirable to normalize the entities. For example, I'm interacting with an API that produces XML where all non-ASCII characters are encoded as numbered entities, making non-Latin-script text completely unreadable. I want to debug and store these files in a format that's human-readable as well as machine-readable, while remaining valid UTF-8 XML.

Describe the solution you'd like

Currently, html/entities exports escape and unescape functions. I suggest exporting a third function (tentatively named normalize) that normalizes all entities in a string of HTML or XML to a form that's valid, interoperable, and (mostly) human-readable:

normalize('<p>&#x4e24;&#x53ea;&#x5c0f;&#x871c;&#x8702;</p>') // '<p>两只小蜜蜂</p>'
normalize('a&b') // 'a&amp;b'
normalize('&#62;&#x3e;&gt;') // '&gt;&gt;&gt;'
normalize('&apos;') // '&#39;'

Describe alternatives you've considered

It might be worth having multiple normalized forms (which would likely also affect the API surface area of escape); for example, a "readability" form that converts &#x4e24;&#x53ea;&#x5c0f;&#x871c;&#x8702; to 两只小蜜蜂 vs a "compatibility" form that converts in the opposite direction. I don't currently have a use case for the "compatibility" form as any XML-consuming APIs I need to interact with either default to UTF-8 or respect UTF-8 where specified, but it might be useful for users needing to interact with legacy or poorly-designed APIs.

lionel-rowe added a commit to lionel-rowe/deno_std that referenced this issue Mar 26, 2024
lionel-rowe added a commit to lionel-rowe/deno_std that referenced this issue Mar 26, 2024
@iuioiua iuioiua added the feedback welcome We want community's feedback on this issue or PR label Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback welcome We want community's feedback on this issue or PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants