Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(html): add normalize function for HTML entities (#4523) #4524

Closed

Conversation

lionel-rowe
Copy link
Contributor

@lionel-rowe lionel-rowe commented Mar 26, 2024

Fixes #4523

Various bikeshedding things:

  • Given its dual use for XML, should the top-level dir have its name changed from html? If so, to what?
    • xml suffers the same issue but in reverse
    • html_and_xml is pretty gross
    • markup seems overly vague
    • sgml_like might be technically correct, but would the average web developer think to look there?
  • Is it worth exporting escapeAllCharsAsHex? Could be useful for users wanting finer-grained control than the 2 normalization forms, but it's a very simple function to replicate
  • Is the name NormalizationForm OK or is it too confusing with Unicode normalization forms (NFC, NFD, etc?)

@lionel-rowe lionel-rowe requested a review from kt3k as a code owner March 26, 2024 11:26
@github-actions github-actions bot added the html label Mar 26, 2024
function escapeXmlRestricted(str: string) {
return str.replaceAll(
// deno-lint-ignore no-control-regex
/[^\x09\x0a\x0d\x20-\x7e\x85\xa0-\ud7ff\ue000-\ufdcf\ufdf0-\ufffd\u{10000}-\u{1fffd}\u{20000}-\u{2fffd}\u{30000}-\u{3fffd}\u{40000}-\u{4fffd}\u{50000}-\u{5fffd}\u{60000}-\u{6fffd}\u{70000}-\u{7fffd}\u{80000}-\u{8fffd}\u{90000}-\u{9fffd}\u{a0000}-\u{afffd}\u{b0000}-\u{bfffd}\u{c0000}-\u{cfffd}\u{d0000}-\u{dfffd}\u{e0000}-\u{efffd}\u{f0000}-\u{ffffd}\u{100000}-\u{10fffd}]+/gu,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we start escaping these chars by default?

Also does HTML have the same restricted chars concept as XML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reflection, this PR needs some more thought. I'd initially thought a sort of "baseline-compatible with both HTML and XML" would be a sensible default, given that these characters are rare in practice yet could cause problems in XML. But it turns out that entities for certain C1 control-character codepoints have different semantics in HTML than XML — for example, €:

['application/xml', 'text/html'].map((contentType) => {
    const { literal, entity } = JSON.parse(
        new DOMParser().parseFromString(
            '<div>{"literal": "\x80", "entity": "&#x80;"}</div>',
            contentType,
        ).querySelector('div').textContent,
    )
    return { contentType, literal, entity }
})

// { "contentType": "application/xml", "literal": "\x80", "entity": "\x80" }
// { "contentType": "text/html", "literal": "\x80", "entity": "€" }

Also, I'm not sure converting those characters to entities is the right approach. Probably simply stripping them out would be more sensible in most cases.

@@ -34,15 +38,23 @@ const rawRe = new RegExp(`[${[...rawToEntity.keys()].join("")}]`, "g");
* // Characters that don't need to be escaped will be left alone,
* // even if named HTML entities exist for them.
* escape("þð"); // "þð"
* // You can force non-ASCII chars to be escaped by setting the `form` option to `compatibility`:
* escape("þð", { form: "compatibility" }); // "&#xfe;&#xf0;"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

form: "compatibility" sounds confusing to me as I don't see what it's compatible with.

*
* // specifying a `form` option (default is `readability`):
* normalize("两只小蜜蜂", { form: "readability" }); // "两只小蜜蜂"
* normalize("两只小蜜蜂", { form: "compatibility" }); // "&#x4e24;&#x53ea;&#x5c0f;&#x871c;&#x8702;"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

form: "compatibility" feels defeating the purpose of normalize to me. How about not having this option for now?

@lionel-rowe lionel-rowe marked this pull request as draft April 23, 2024 05:59
@iuioiua
Copy link
Collaborator

iuioiua commented May 6, 2024

@lionel-rowe, to not keep stale PRs open, are you happy for us to close this PR for now? You can re-open the existing PR or a new one once it is ready.

@lionel-rowe
Copy link
Contributor Author

@lionel-rowe, to not keep stale PRs open, are you happy for us to close this PR for now? You can re-open the existing PR or a new one once it is ready.

Sure, I'll close now.

@lionel-rowe lionel-rowe closed this May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestion: Normalize HTML/XML entities
3 participants