Add `is_normalized_up_to` to Normalizer #4334

CanadaHonk · 2023-11-20T16:30:12Z

Closes #4256. Added UTF8 variant to FFI as is_normalized_up_to. No UTF16 tests or fuzzing yet.

CanadaHonk · 2023-11-20T16:43:13Z

Added more UTF8 tests.

CanadaHonk · 2023-11-20T17:14:45Z

Added UTF8 variant to FFI as is_normalized_up_to.

CanadaHonk · 2023-11-20T17:47:53Z

Added UTF16 variant to FFI coverage allowlist.

hsivonen

Looks great for the implementation of the &str case. Thanks! The docs need improvement.

The exact semantics of the &[u8] and &[u16] case still need thinking.

hsivonen · 2023-11-22T09:37:54Z

components/normalizer/src/lib.rs

@@ -1457,6 +1457,13 @@ macro_rules! normalizer_methods {
            ret
        }

+        /// Return the index a string slice is normalized up to


For docs, please, frame this explicitly as a building block of copy-on-write normalization. That the return value is the largest index at which the input can be split to head and tail such that the head can be copied to output and only tail needs to go through the normalization process.

It would be the clearest to provide the invariant-stating code from the issue (with stuff filled in so that it compiles as a doc test) as an explanation of what this is for.

components/normalizer/src/lib.rs

hsivonen · 2023-11-22T09:54:52Z

components/normalizer/tests/tests.rs

+    assert!(nfc.is_normalized_utf8_up_to(note_utf8) == note_utf8.len());
+    assert!(nfkc.is_normalized_utf8_up_to(note_utf8) == note_utf8.len());
+
+    let umlaut = "aäa";


Please include a test case where the input is "e\u{0302}\u{0323}" (i.e. ệ decomposed with the combining marks in the wrong order). Notably, "e\u{0302}X" and "e\u{0302}\u{0323}X" should give a different "up to" value in NFD despite the common prefix.

This test is still missing, right?

hsivonen · 2023-11-22T09:57:37Z

components/normalizer/tests/tests.rs

+    assert!(nfkd.is_normalized_utf8_up_to(fraction_utf8) == 1);
+    assert!(nfc.is_normalized_utf8_up_to(fraction_utf8) == 4);
+    assert!(nfkc.is_normalized_utf8_up_to(fraction_utf8) == 1);
+}


To test a lot of input, short of fuzzing, it would be useful to copypaste test_conformance() into a new test where instead of the test's innermost testable being normalize_to, you'd do normalize_up_to, spit_at, copy the head, and use normalize_to for the tail.

ffi/capi/src/normalizer.rs

sffc · 2023-11-30T16:31:27Z

@hsivonen - Are we relitigating the decision on well-formedness?
@sffc - I don't recall having an explicit decision about that.
@sffc - How disruptive would it be to roundtrip ill-formed text?
@hsivonen - Tricky in UTF-8, and I feel there isn't much demand for that. In UTF-16, probably more doable. We would want the internal type to be non-char; I'm not sure how disruptive that would be. One option would be to detect non-scalar-values before handing them to the algorithm. But what's the use case of sticking arbitrary binary data in a JavaScript string and then normalizing it as a Unicode string? Someone wrote the simplest possible spec language and then the web reflects ICU4C behavior's of not scrubbing unpaired surrogates. But for us it's simpler to scrub them. Also, when Gecko was switching the style system, we took the risk of not round-tripping unpaired surrages through the CSS object model, and it was fine. I'm skeptical about anyone wanting to both use normalization and do things that would be disruptive if unpaired surrogates were replaced by the replacement character.
@sffc - Ok, we don't have anyone clammoring right now for changing unpaired surrogate behavior, so let's not change anything right now, but keep the door open in case someone asks for it.
@hsivonen - Okay, so then to the topic in the issue. is_normalized says yes if the string contains unpaired surrogates, but that isn't copy-on-write if is_normalized_up_to stops at one.
@sffc - My initial reaction is we should just change is_normalized to have that same behavior (returns false if the string contains unpaired surrogates).
@Manishearth - I'm fine with handling ill-formed UTF-16 if someone asks us. On the web, WPT is the place I've encountered it. People write tests for it, but I don't know we see it in the real world. For normalize, we should do something, but for is_normalized, it makes sense to just return false and document it.
@hsivonen - Cool, so what should we name the error? I'm inclined to use () as the error.
@Manishearth - Ehh, I'm inclined to a named error, but if you're confident there's only going to be one error type... we've done it before. But generally we have named errors, one per crate.
@sffc - The standard library has a lot of these unit errors. For example, TryFromCharError. I think we should just do that and make it implement Display.
@robertbastian - Struct with no fields seems easier.

Conclusion:

Change is_normalized to return false on strings containing unpaired surrogates
Make is_normalized_up_to stop when it encounters an unpaired surrogate
Make a new unit struct error type in @hsivonen's crate that implements core::fmt::Display

LGTM: @hsivonen @sffc @Manishearth @robertbastian

sffc

@CanadaHonk Are you planning to follow up on this PR?

robertbastian · 2024-03-07T19:39:14Z

This should also land behind an experimental feature.

hsivonen · 2024-03-22T12:06:41Z

This should also land behind an experimental feature.

Why? I thought we resolved in the November 30 discussion minuted above that this can land unconditionally.

robertbastian · 2024-03-22T12:28:51Z

That's not clear to me from the minutes, and I don't remember details of this discussion. That said, "stable" in 1.5 doesn't mean much anyway, as we can make breaking changes in 2.0, so it's not that important.

robertbastian · 2024-04-23T09:54:18Z

tools/ffi_coverage/src/allowlist.rs

@@ -264,9 +264,11 @@ lazy_static::lazy_static! {
        "icu::normalizer::ComposingNormalizer::normalize_utf16",
        "icu::normalizer::ComposingNormalizer::normalize_utf16_to",
        "icu::normalizer::ComposingNormalizer::is_normalized_utf16",
+        "icu::normalizer::ComposingNormalizer::is_normalized_utf16_up_to",


please expose these, using &DiplomatStr16

Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.

CanadaHonk force-pushed the normalized-up-to branch from cdd4ffc to 553413e Compare November 20, 2023 16:42

CanadaHonk force-pushed the normalized-up-to branch 3 times, most recently from 16c2a4b to 8240b4c Compare November 20, 2023 17:14

CanadaHonk changed the title ~~Add is_normalized_up_tos~~ Add is_normalized_up_to to Normalizer Nov 20, 2023

CanadaHonk force-pushed the normalized-up-to branch 2 times, most recently from 1681ae9 to 289a27a Compare November 20, 2023 17:47

hsivonen requested changes Nov 22, 2023

View reviewed changes

hsivonen added the discuss-priority Discuss at the next ICU4X meeting label Nov 29, 2023

sffc removed the discuss-priority Discuss at the next ICU4X meeting label Nov 30, 2023

robertbastian added the waiting-on-author PRs waiting for action from the author for >7 days label Feb 8, 2024

robertbastian marked this pull request as ready for review February 28, 2024 15:02

robertbastian requested review from Manishearth, echeran and a team as code owners February 28, 2024 15:02

sffc reviewed Mar 5, 2024

View reviewed changes

robertbastian reviewed Apr 23, 2024

View reviewed changes

Add is_normalized_up_to to Normalizer

1665408

Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.

CanadaHonk force-pushed the normalized-up-to branch from f06461e to 1665408 Compare April 23, 2024 13:49

Manishearth removed their request for review April 23, 2024 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `is_normalized_up_to` to Normalizer #4334

Add `is_normalized_up_to` to Normalizer #4334

CanadaHonk commented Nov 20, 2023 •

edited

CanadaHonk commented Nov 20, 2023

CanadaHonk commented Nov 20, 2023

CanadaHonk commented Nov 20, 2023

hsivonen left a comment

hsivonen Nov 22, 2023

hsivonen Nov 22, 2023

hsivonen May 16, 2024

hsivonen Nov 22, 2023

sffc commented Nov 30, 2023

sffc left a comment

robertbastian commented Mar 7, 2024

hsivonen commented Mar 22, 2024

robertbastian commented Mar 22, 2024

robertbastian Apr 23, 2024

Add is_normalized_up_to to Normalizer #4334

Are you sure you want to change the base?

Add is_normalized_up_to to Normalizer #4334

Conversation

CanadaHonk commented Nov 20, 2023 • edited

CanadaHonk commented Nov 20, 2023

CanadaHonk commented Nov 20, 2023

CanadaHonk commented Nov 20, 2023

hsivonen left a comment

Choose a reason for hiding this comment

hsivonen Nov 22, 2023

Choose a reason for hiding this comment

hsivonen Nov 22, 2023

Choose a reason for hiding this comment

hsivonen May 16, 2024

Choose a reason for hiding this comment

hsivonen Nov 22, 2023

Choose a reason for hiding this comment

sffc commented Nov 30, 2023

sffc left a comment

Choose a reason for hiding this comment

robertbastian commented Mar 7, 2024

hsivonen commented Mar 22, 2024

robertbastian commented Mar 22, 2024

robertbastian Apr 23, 2024

Choose a reason for hiding this comment

Add `is_normalized_up_to` to Normalizer #4334

Add `is_normalized_up_to` to Normalizer #4334

CanadaHonk commented Nov 20, 2023 •

edited