Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make UTS 46 normalization non-experimental #4712

Merged
merged 18 commits into from May 20, 2024
Merged

Conversation

hsivonen
Copy link
Member

  • Bake ignored/disallow data into the normalization data after all.
  • Make public operations available via a dedicated wrapper type instead of the main normalizer types.

Closes #2850

 * Bake ignored/disallow data into the normalization data after all.
 * Make public operations available via a dedicated wrapper type
   instead of the main normalizer types.

Closes unicode-org#2850
@hsivonen
Copy link
Member Author

hsivonen commented Apr 4, 2024

The ICU PR is unicode-org/icu#2945

@hsivonen hsivonen marked this pull request as ready for review April 4, 2024 12:37
@hsivonen
Copy link
Member Author

hsivonen commented Apr 4, 2024

@hsivonen hsivonen marked this pull request as draft April 4, 2024 12:44
@hsivonen hsivonen marked this pull request as ready for review April 4, 2024 19:48
@hsivonen
Copy link
Member Author

hsivonen commented Apr 4, 2024

It seems that the remainin CI failures are due to the ICU4C part not yet contributing to the exported data, so it seems OK to start reviewing this.

It doesn't make much sense to expose this API over FFI. If other languages want ICU4X-backed UTS 46, it makes sense to introduce FFI for https://github.com/hsivonen/rust-url/blob/icu4x/idna/src/uts46.rs .

@hsivonen hsivonen requested review from eggrobin and removed request for echeran April 15, 2024 11:52
@hsivonen
Copy link
Member Author

Changing review request to @eggrobin per discussion with @echeran .

@sffc
Copy link
Member

sffc commented May 3, 2024

@eggrobin says he is not the correct person to review this from the code side. Adding back @echeran for the code review.

@sffc sffc requested a review from echeran May 3, 2024 21:59
echeran
echeran previously approved these changes May 14, 2024
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tables,
None,
0xC0,
IgnorableBehavior::Unsupported,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why is the default value for behavior of handling an ignored character not to ignore it (IgnorableBehavior::Ignored)?

IgnorableBehavior::Ignored is what UTS 46 would do by default, right?

Or do you default to Unsupported because this is in the underlying Decomposition struct, which can be used independently of UTS 46, and thus would want to be kept separate by default unless otherwise specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying struct can be used for non-UTS46 purposes: the usual normalizations. For those, we expose contiguous-buffer APIs, and the contiguous-buffer versions currently can't deal with ignorables, and since there is currently no use case for them to deal with ignorables, I think it's better to have a state that they can debug_assert! against than to add support for ignorables is the contiguous-buffer entry points.

@echeran
Copy link
Contributor

echeran commented May 14, 2024

question + suggestion (optional): IIUC, at this point, UTS 46 is pretty much IDNA2008. Is that correct? If so, you could make a mention of that in the docs becuase UTS 46 covers not just IDNA2008 but also IDNA2003 and the transition rules between the two. If at this point, everyone has switched over to IDNA2008, including all major browsers, it would help the docs to add that extra specificity.

@hsivonen
Copy link
Member Author

question + suggestion (optional): IIUC, at this point, UTS 46 is pretty much IDNA2008. Is that correct? If so, you could make a mention of that in the docs becuase UTS 46 covers not just IDNA2008 but also IDNA2003 and the transition rules between the two. If at this point, everyone has switched over to IDNA2008, including all major browsers, it would help the docs to add that extra specificity.

No, UTS 46 non-transitional accepts a superset of what IDNA 2008 accepts and the major browsers use UTS 46 non-transitional specifically. (To use an example from the UTS 46 spec, switching to pure IDNA 2008 would break URLs like http://www.ÖBB.at/ .) Beyond the remark in the docs about the three major browser engines, I'd prefer to leave the characterization of the situation to the UTS 46 spec itself.

LGTM

Thanks! Since unicode-org/icu#2945 looks ready to land, I'll open another ICU4C PR with a backport to the maintenance branch. Then we can make datagen over on the ICU4X side pull the updated export before merging this.

@hsivonen
Copy link
Member Author

Opened #4905 about using data exported with the backport of the ICU4C patch.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after merge and clippy fix. Replying on @echeran's approval of business logic.

@hsivonen hsivonen merged commit f5a650a into unicode-org:main May 20, 2024
30 checks passed
hsivonen added a commit to hsivonen/icu4x that referenced this pull request May 22, 2024
* Bake ignored/disallow data into the normalization data after all.
* Make public operations available via a dedicated wrapper type instead
of the main normalizer types.

Closes unicode-org#2850
sffc pushed a commit that referenced this pull request May 23, 2024
* Bake ignored/disallow data into the normalization data after all.
* Make public operations available via a dedicated wrapper type instead
of the main normalizer types.

Closes #2850
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide UTS 46 mapping / disallow operations fused with normalization
4 participants