Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide UTS 46 mapping / disallow operations fused with normalization #2850

Closed
hsivonen opened this issue Nov 30, 2022 · 12 comments · Fixed by #4712
Closed

Provide UTS 46 mapping / disallow operations fused with normalization #2850

hsivonen opened this issue Nov 30, 2022 · 12 comments · Fixed by #4712
Labels
C-collator Component: Collation, normalization U-gecko User: Gecko

Comments

@hsivonen
Copy link
Member

Gecko implements most of its IDNA processing on its own, but it currently uses ICU4C's uidna_labelToUnicode for UTS 46 processing.

Code that wraps ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable and augments it to provide the functionality of ICU4C's uidna_labelToUnicode needs to go somewhere: Since it needs some data for ignored and disallowed that is in sync with ICU4X data, it would make sense to start a new ICU4X component for it. (We already reserve the name.)

@hsivonen hsivonen added the C-collator Component: Collation, normalization label Nov 30, 2022
@hsivonen
Copy link
Member Author

hsivonen commented Dec 1, 2022

Should we even provide transitional processing, since it looks like the transition period is finally ending?

@sffc
Copy link
Member

sffc commented Dec 1, 2022

What timing with the Mustafa email!

As discussed in #42, I prefer for IDNA to be its own crate that can be powered with ICU4X data. I think the unicode_bidi solution is an appropriate model to use here.

This is a concern largely driven by governance: ICU4X is already a large project, and I don't wish to take ownership of components for which we don't have sufficient knowledge to maintain. We accepted Collator and Normalizer because they are integral to ECMA-402; this required a fairly substantial effort between yourself and @echeran as co-owner, and future changes will require one of you as a reviewer.

@hsivonen
Copy link
Member Author

hsivonen commented Dec 1, 2022

After exploring the Rust crate space around this topic a tiny bit, I'm not particularly happy to find how much code is needed to match uidna_labelToUnicode given the existence of ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable. Still, ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable is a very odd primitive to provide without providing the data for 'disallowed' and 'ignored'.

How about this: We put a data struct carrying the disallowed & ignored data in a crate called icu_idna and then make the crate provide an iterator adapter that consumes an iterator over char, exposes an iterator over Result<char, IdnaError>, and in between the input and output performs the combination of the "Map" and "Normalize" steps of the "Processing" part in UTF 46?

This would put all the relevant data into ICU4X but would leave Punycode and the rest out of ICU4X for now. (ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable is designed for this and it would be really weird to have to get the disallowed & ignored data from outside ICU4X.)

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Dec 1, 2022
@sffc
Copy link
Member

sffc commented Dec 1, 2022

@markusicu @echeran can you provide input here? What sensible primitive can ICU4X expose that enables IDNA to be implemented externally?

@hsivonen
Copy link
Member Author

hsivonen commented Dec 2, 2022

Note: My assumption is that UTS 46 section 4.1 Validity Criteria items 1 and 6 would be checked by applying the Map & Normalize primitive proposed above and checking the result for error or equality with input. This means that in Processing step 4 xn-- case, the Map & Normalize primitive would run again and int he non-xn-- case items 1 and 6 would already be OK by construction.

Please let me know if I've missed something and this wouldn't hold.

@hsivonen
Copy link
Member Author

hsivonen commented Dec 2, 2022

Oh, and the data needs to support flagging disallowed-if-UseSTD3ASCIIRules=true (and ComposingNormalizer::try_new_uts46_without_ignored_and_disallowed_unstable needs to reflect UseSTD3ASCIIRules=false).

@hsivonen
Copy link
Member Author

hsivonen commented Dec 2, 2022

We need a trie with two bits per scalar value. We could pack the bits for four scalar values in one u8 trie value by shifting the code point value right by two before lookup and using the two least-significant bits times two to decide how much to shift the trie value right before taking the two least significant bits as a four-item enum:

  1. pass through (default)
  2. ignore
  3. disallowed
  4. disallowed if STD3

@sffc
Copy link
Member

sffc commented Feb 16, 2023

Ping @markusicu @echeran

@hsivonen
Copy link
Member Author

Current notes:

  • The data contemplated earlier should probably go into icu_normalizer so that icu_normalizer can expose a coherent operation to an IDNA crate instead of exposing a weird normalization that requires an extra preprocessing step to make sense.
  • It turns out that in the WHATWG URL Standard, STD3 enforcement is only for checking conformance of content but irrelevant to non-conformance-checker URL consumers. (In particular, Gecko doesn't appear to ever enable STD3 enforcement.)
    • How much will it cost to carry about the STD3 data when it's not actually used?

@hsivonen
Copy link
Member Author

From merely looking at the data with knowledge of the relevant data structure, making everyone carry the STD3 disallowed info does not seem particularly nice, but I haven't actually measured, yet.

@hsivonen
Copy link
Member Author

Now that I've looked at this some more, it seems to me that It's a bad idea to check the input scalar values for being STD3-disallowed and instead of makes sense to check the output for STD3-prohibited ASCII if the STD3 check is in effect. Am I missing something?

@markusicu
Copy link
Member

About UseSTD3ASCIIRules:

  • Check for STD3-disallowed after mapping.
  • You don't need two versions of the data. Just use the non-STD3 data and then check for STD3-disallowed ASCII characters.
  • In an ASCII-only fastpath, you can fuse the mapping & checking, including for STD3-disallowed ASCII if UseSTD3ASCIIRules=true.

@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Mar 15, 2024
@sffc sffc added this to the Backlog ⟨P4⟩ milestone Mar 15, 2024
@hsivonen hsivonen changed the title Provide equivalent of uidna_labelToUnicode Provide UTS 46 mapping / disallow operations fused with normalization Mar 20, 2024
hsivonen added a commit to hsivonen/icu4x that referenced this issue Mar 20, 2024
 * Bake ignored/disallow data into the normalization data after all.
 * Make public operations available via a dedicated wrapper type
   instead of the main normalizer types.

Closes unicode-org#2850
hsivonen added a commit that referenced this issue May 20, 2024
* Bake ignored/disallow data into the normalization data after all.
* Make public operations available via a dedicated wrapper type instead
of the main normalizer types.

Closes #2850
hsivonen added a commit to hsivonen/icu4x that referenced this issue May 22, 2024
* Bake ignored/disallow data into the normalization data after all.
* Make public operations available via a dedicated wrapper type instead
of the main normalizer types.

Closes unicode-org#2850
sffc pushed a commit that referenced this issue May 23, 2024
* Bake ignored/disallow data into the normalization data after all.
* Make public operations available via a dedicated wrapper type instead
of the main normalizer types.

Closes #2850
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization U-gecko User: Gecko
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants