Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide methods for finding the normalized prefix of input #4256

Open
hsivonen opened this issue Nov 7, 2023 · 1 comment · May be fixed by #4334
Open

Provide methods for finding the normalized prefix of input #4256

hsivonen opened this issue Nov 7, 2023 · 1 comment · May be fixed by #4334
Labels
C-collator Component: Collation, normalization U-gecko User: Gecko

Comments

@hsivonen
Copy link
Member

hsivonen commented Nov 7, 2023

ComposingNormalizer and DecomposingNormalizer currently provide methods is_normalized(), is_normalized_utf8(), and is_normalized_utf16(). If the return value is false and the application then decides to normalize, the normalization-related data structure lookups are done twice for the (potential) already-normalized prefix.

ComposingNormalizer and DecomposingNormalizer should provide methods is_normalized_up_to(), is_normalized_utf8_up_to(), and is_normalized_utf16_up_to() that return usize such that the return value is the largest possible (but no larger than the length of input) with which the following assert passes:

fn test_is_normalized_up_to(input: &str) {
  // set up normalizer `norm`
  let up_to = norm.is_normalized_up_to(input)`
  let (head, tail) = input.split_at(up_to);
  let mut normalized = String::from(head);
  let _ = norm.normalize_to(tail, &mut normalized);
  assert!(norm.is_normalized(&normalized));
}

Then this should become a valid alternative implementation of is_normalized():

pub fn is_normalized(&self, text: &str) -> bool {
    self.is_normalized_up_to(text) == text.len()
}

Gecko use case: https://searchfox.org/mozilla-central/rev/e94bcd536a2a4caad0597d1b2d624342e6a389c4/intl/components/src/String.h#132

(Note that ICU4X deliberately doesn't implement quick check, which Gecko currently uses for the prefix computation.)

@hsivonen hsivonen added C-collator Component: Collation, normalization U-gecko User: Gecko labels Nov 7, 2023
@hsivonen
Copy link
Member Author

hsivonen commented Nov 7, 2023

CC @CanadaHonk

CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023
Closes unicode-org#4256. No UTF16 tests or fuzzing yet.
@CanadaHonk CanadaHonk linked a pull request Nov 20, 2023 that will close this issue
CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023
Closes unicode-org#4256. No UTF16 tests or fuzzing yet.
CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023
Closes unicode-org#4256. No UTF16 tests or fuzzing yet. Also added UTF8 variant to FFI as `is_normalized_up_to`.
CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023
Closes unicode-org#4256. No UTF16 tests or fuzzing yet. Also added UTF8 variant to FFI as `is_normalized_up_to`.
CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023
Closes unicode-org#4256. No UTF16 tests or fuzzing yet. Also added UTF8 variant to FFI as `is_normalized_up_to`.
CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023
Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.
CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Nov 20, 2023
Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.
CanadaHonk added a commit to CanadaHonk/icu4x that referenced this issue Apr 23, 2024
Closes unicode-org#4256. Added UTF8 variant to FFI as `is_normalized_up_to`. No UTF16 tests or fuzzing yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-collator Component: Collation, normalization U-gecko User: Gecko
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant