Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add is_normalized_up_to to Normalizer #4334

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

CanadaHonk
Copy link
Contributor

@CanadaHonk CanadaHonk commented Nov 20, 2023

Closes #4256. Added UTF8 variant to FFI as is_normalized_up_to. No UTF16 tests or fuzzing yet.

@CanadaHonk
Copy link
Contributor Author

Added more UTF8 tests.

@CanadaHonk CanadaHonk force-pushed the normalized-up-to branch 3 times, most recently from 16c2a4b to 8240b4c Compare November 20, 2023 17:14
@CanadaHonk
Copy link
Contributor Author

Added UTF8 variant to FFI as is_normalized_up_to.

@CanadaHonk CanadaHonk changed the title Add is_normalized_up_tos Add is_normalized_up_to to Normalizer Nov 20, 2023
@CanadaHonk CanadaHonk force-pushed the normalized-up-to branch 2 times, most recently from 1681ae9 to 289a27a Compare November 20, 2023 17:47
@CanadaHonk
Copy link
Contributor Author

Added UTF16 variant to FFI coverage allowlist.

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great for the implementation of the &str case. Thanks! The docs need improvement.

The exact semantics of the &[u8] and &[u16] case still need thinking.

@@ -1457,6 +1457,13 @@ macro_rules! normalizer_methods {
ret
}

/// Return the index a string slice is normalized up to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For docs, please, frame this explicitly as a building block of copy-on-write normalization. That the return value is the largest index at which the input can be split to head and tail such that the head can be copied to output and only tail needs to go through the normalization process.

It would be the clearest to provide the invariant-stating code from the issue (with stuff filled in so that it compiles as a doc test) as an explanation of what this is for.

components/normalizer/src/lib.rs Outdated Show resolved Hide resolved
components/normalizer/src/lib.rs Show resolved Hide resolved
components/normalizer/src/lib.rs Show resolved Hide resolved
assert!(nfc.is_normalized_utf8_up_to(note_utf8) == note_utf8.len());
assert!(nfkc.is_normalized_utf8_up_to(note_utf8) == note_utf8.len());

let umlaut = "aäa";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a test case where the input is "e\u{0302}\u{0323}" (i.e. ệ decomposed with the combining marks in the wrong order). Notably, "e\u{0302}X" and "e\u{0302}\u{0323}X" should give a different "up to" value in NFD despite the common prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still missing, right?

assert!(nfkd.is_normalized_utf8_up_to(fraction_utf8) == 1);
assert!(nfc.is_normalized_utf8_up_to(fraction_utf8) == 4);
assert!(nfkc.is_normalized_utf8_up_to(fraction_utf8) == 1);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test a lot of input, short of fuzzing, it would be useful to copypaste test_conformance() into a new test where instead of the test's innermost testable being normalize_to, you'd do normalize_up_to, spit_at, copy the head, and use normalize_to for the tail.

ffi/capi/src/normalizer.rs Show resolved Hide resolved
@hsivonen hsivonen added the discuss-priority Discuss at the next ICU4X meeting label Nov 29, 2023
@sffc
Copy link
Member

sffc commented Nov 30, 2023

  • @hsivonen - Are we relitigating the decision on well-formedness?
  • @sffc - I don't recall having an explicit decision about that.
  • @sffc - How disruptive would it be to roundtrip ill-formed text?
  • @hsivonen - Tricky in UTF-8, and I feel there isn't much demand for that. In UTF-16, probably more doable. We would want the internal type to be non-char; I'm not sure how disruptive that would be. One option would be to detect non-scalar-values before handing them to the algorithm. But what's the use case of sticking arbitrary binary data in a JavaScript string and then normalizing it as a Unicode string? Someone wrote the simplest possible spec language and then the web reflects ICU4C behavior's of not scrubbing unpaired surrogates. But for us it's simpler to scrub them. Also, when Gecko was switching the style system, we took the risk of not round-tripping unpaired surrages through the CSS object model, and it was fine. I'm skeptical about anyone wanting to both use normalization and do things that would be disruptive if unpaired surrogates were replaced by the replacement character.
  • @sffc - Ok, we don't have anyone clammoring right now for changing unpaired surrogate behavior, so let's not change anything right now, but keep the door open in case someone asks for it.
  • @hsivonen - Okay, so then to the topic in the issue. is_normalized says yes if the string contains unpaired surrogates, but that isn't copy-on-write if is_normalized_up_to stops at one.
  • @sffc - My initial reaction is we should just change is_normalized to have that same behavior (returns false if the string contains unpaired surrogates).
  • @Manishearth - I'm fine with handling ill-formed UTF-16 if someone asks us. On the web, WPT is the place I've encountered it. People write tests for it, but I don't know we see it in the real world. For normalize, we should do something, but for is_normalized, it makes sense to just return false and document it.
  • @hsivonen - Cool, so what should we name the error? I'm inclined to use () as the error.
  • @Manishearth - Ehh, I'm inclined to a named error, but if you're confident there's only going to be one error type... we've done it before. But generally we have named errors, one per crate.
  • @sffc - The standard library has a lot of these unit errors. For example, TryFromCharError. I think we should just do that and make it implement Display.
  • @robertbastian - Struct with no fields seems easier.

Conclusion:

  • Change is_normalized to return false on strings containing unpaired surrogates
  • Make is_normalized_up_to stop when it encounters an unpaired surrogate
  • Make a new unit struct error type in @hsivonen's crate that implements core::fmt::Display

LGTM: @hsivonen @sffc @Manishearth @robertbastian

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Nov 30, 2023
@robertbastian robertbastian added the waiting-on-author PRs waiting for action from the author for >7 days label Feb 8, 2024
@robertbastian robertbastian marked this pull request as ready for review February 28, 2024 15:02
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CanadaHonk Are you planning to follow up on this PR?

@robertbastian
Copy link
Member

This should also land behind an experimental feature.

@hsivonen
Copy link
Member

This should also land behind an experimental feature.

Why? I thought we resolved in the November 30 discussion minuted above that this can land unconditionally.

@robertbastian
Copy link
Member

That's not clear to me from the minutes, and I don't remember details of this discussion. That said, "stable" in 1.5 doesn't mean much anyway, as we can make breaking changes in 2.0, so it's not that important.

@@ -264,9 +264,11 @@ lazy_static::lazy_static! {
"icu::normalizer::ComposingNormalizer::normalize_utf16",
"icu::normalizer::ComposingNormalizer::normalize_utf16_to",
"icu::normalizer::ComposingNormalizer::is_normalized_utf16",
"icu::normalizer::ComposingNormalizer::is_normalized_utf16_up_to",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please expose these, using &DiplomatStr16

Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-on-author PRs waiting for action from the author for >7 days
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide methods for finding the normalized prefix of input
4 participants