Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example for DecomposingNormalizer source cursor #4900

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sffc
Copy link
Member

@sffc sffc commented May 14, 2024

normalize_iter gives us the ability to keep track of the source string indices while generating the output string. I wrote the example using a RefCell since the DecomposingNormalizer takes ownership over the source iterator. There may be a way to avoid RefCell by adding a function to the library.

/// };
///
/// assert_eq!(get_next(), ('S', 'Š', 0));
/// assert_eq!(get_next(), ('\u{30C}', 'Š', 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is the reason why the offset only jumps at most by 2 in this example because all of the characters are in a precomposed form in the original input string in the range U+0080 <= ch < U+0800 ? If so, then optional: it might be interesting to append to the input string something in the upper half of the BMP, and maybe something beyond the BMP.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I can add non-BMP code points to this example.

I was also hoping maybe you could shed some light on this behavior. Is it always guaranteed that the iterator peeks one code point ahead, as stated in this PR?

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes lookahead of at most one, but the normalizer can do unbounded lookahead, since the number of reorderable combining characters is potentially unbounded (since nothing guarantees that the input has the "stream-safe" property from UAX 15).

@@ -1864,6 +1864,92 @@ impl DecomposingNormalizer {

/// Wraps a delegate iterator into a decomposing iterator
/// adapter by using the data already held by this normalizer.
///
/// The [`Decomposition`] iterator will peek exactly one character
/// ahead of the character being decomposed, allowing the caller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the character being decomposed is followed by characters whose canonical combining class is not zero, the normalizer will buffer up all of those in order to be able to reorder them in case they aren't already in the right order.

@hsivonen
Copy link
Member

It would be good to have a description of the use case to see if a) the use case can be addressed at all and b) how to best address it.

To the extent the purpose is to correlate pieces of input &str with pieces of output &str, it's probably useful to make use of the same implementation detail that IsNormalizedSinkStr makes use of: when the normalizer passes a &str to Write, it for sure is a passthrough that can be correlated back to the input slice by looking at the pointer in the slice. When the normalizer passes a char it may be either a passthrough or a non-passthrough, but every time there is a &str, the &str can be used to resynchronize char passthrough tracking after a non-passthrough char has caused a divergence.

@sffc
Copy link
Member Author

sffc commented May 20, 2024

Thanks; I thought the invariant upon which my code was based maybe wasn't right, but I couldn't identify or articulate how. Reordering characters makes total sense.

The use case is being able to map characters between input and output string with a machine learning use case. My understanding is that it is desirable to identify ranges of source text that were used to make inferences from the model. CC @j-luo93 who can maybe share more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants