New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add is_normalized_up_to
to Normalizer
#4334
base: main
Are you sure you want to change the base?
Conversation
cdd4ffc
to
553413e
Compare
Added more UTF8 tests. |
16c2a4b
to
8240b4c
Compare
Added UTF8 variant to FFI as |
is_normalized_up_to
sis_normalized_up_to
to Normalizer
1681ae9
to
289a27a
Compare
Added UTF16 variant to FFI coverage allowlist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great for the implementation of the &str
case. Thanks! The docs need improvement.
The exact semantics of the &[u8]
and &[u16]
case still need thinking.
components/normalizer/src/lib.rs
Outdated
@@ -1457,6 +1457,13 @@ macro_rules! normalizer_methods { | |||
ret | |||
} | |||
|
|||
/// Return the index a string slice is normalized up to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For docs, please, frame this explicitly as a building block of copy-on-write normalization. That the return value is the largest index at which the input can be split to head and tail such that the head can be copied to output and only tail needs to go through the normalization process.
It would be the clearest to provide the invariant-stating code from the issue (with stuff filled in so that it compiles as a doc test) as an explanation of what this is for.
assert!(nfc.is_normalized_utf8_up_to(note_utf8) == note_utf8.len()); | ||
assert!(nfkc.is_normalized_utf8_up_to(note_utf8) == note_utf8.len()); | ||
|
||
let umlaut = "aäa"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a test case where the input is "e\u{0302}\u{0323}" (i.e. ệ decomposed with the combining marks in the wrong order). Notably, "e\u{0302}X" and "e\u{0302}\u{0323}X" should give a different "up to" value in NFD despite the common prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is still missing, right?
assert!(nfkd.is_normalized_utf8_up_to(fraction_utf8) == 1); | ||
assert!(nfc.is_normalized_utf8_up_to(fraction_utf8) == 4); | ||
assert!(nfkc.is_normalized_utf8_up_to(fraction_utf8) == 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test a lot of input, short of fuzzing, it would be useful to copypaste test_conformance()
into a new test where instead of the test's innermost testable being normalize_to
, you'd do normalize_up_to
, spit_at
, copy the head, and use normalize_to
for the tail.
Conclusion:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CanadaHonk Are you planning to follow up on this PR?
This should also land behind an experimental feature. |
Why? I thought we resolved in the November 30 discussion minuted above that this can land unconditionally. |
That's not clear to me from the minutes, and I don't remember details of this discussion. That said, "stable" in 1.5 doesn't mean much anyway, as we can make breaking changes in 2.0, so it's not that important. |
@@ -264,9 +264,11 @@ lazy_static::lazy_static! { | |||
"icu::normalizer::ComposingNormalizer::normalize_utf16", | |||
"icu::normalizer::ComposingNormalizer::normalize_utf16_to", | |||
"icu::normalizer::ComposingNormalizer::is_normalized_utf16", | |||
"icu::normalizer::ComposingNormalizer::is_normalized_utf16_up_to", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please expose these, using &DiplomatStr16
Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.
f06461e
to
1665408
Compare
Closes #4256. Added UTF8 variant to FFI as
is_normalized_up_to
. No UTF16 tests or fuzzing yet.