Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range #4503

Open
drott opened this issue Nov 22, 2023 · 9 comments
Labels
Chrome Chrome/Chromium project related issues and requests

Comments

@drott
Copy link
Collaborator

drott commented Nov 22, 2023

See details in https://crbug.com/1499787

When shaping --unicodes 180e,1821,202f,1836,1822 with Noto Sans Mongolian subsetted to latin - the U+202F space is synthesized with the space from the ASCII range, which then leads to breaking shaping down the line when shaping with the Mongolian subset.

Space synthetisation is generally useful and we don't want to plainly switch it off, but it would be useful if HarfBuzz could be told to stay within a specified unicode range or call back to the client to ask "can_synthesize_space_for?" with the codepoint from the input buffer for which a space is about to be synthesized.

Your thoughts are welcome on this issue.

@drott drott added the Chrome Chrome/Chromium project related issues and requests label Nov 22, 2023
@behdad
Copy link
Member

behdad commented Nov 22, 2023

We already do that, don't we?

(c->font->get_nominal_glyph (0x0020, &space_glyph) || (space_glyph = buffer->invisible)))

@drott
Copy link
Collaborator Author

drott commented Nov 22, 2023

If I read this code right, it asks for U+0020 space glyphs through the glyph lookup callback function. That succeeds when we try the first (from the bottom) from the Noto Sans Mongolian subsets (from the CSS). But what we would need is a callback the other way round: Ask if it's okay to synthesize for U+202F from the unicode range U+0000-00FF. The synthesized space breaks the run, as that shaped glyph completes from the latin set, later the unshaped parts don't form the connection with the U+202F anymore.

@behdad
Copy link
Member

behdad commented Nov 22, 2023

How would your callback know what to answer?

Ways I see around this are:

  • Special-case NARROW NO-BREAK SPACE, since it has semantic meaning in Mongolian encoding. Always, or if script is Mongolian.
  • Somehow cluster NNBS with previous cluster.
  • In Chrome completely disable space synthesis, and only if a space character could not be shaped, then shape it with space synthesis enabled.

@drott
Copy link
Collaborator Author

drott commented Nov 22, 2023

How would your callback know what to answer?

The callback would look at unicode-range of the current subset and only if U+202F would be in the unicode-range of it, allow synthesis. So it would practically disable synthesis for U+202F for when the latin subset is in use. - Synthesis in a way violates unicode range as the font is used for a codepoint that may be outside unicode-range. In a way, similar to how we restrict the results of get_nominal_glyph, we would answer with allowing synthesis only for what's within the range.

Special-case NARROW NO-BREAK SPACE, since it has semantic meaning in Mongolian encoding. Always, or if script is Mongolian.

Do you mean on the HarfBuzz side in shaping or clustering?

@behdad
Copy link
Member

behdad commented Nov 22, 2023

How would your callback know what to answer?

The callback would look at unicode-range of the current subset and only if U+202F would be in the unicode-range of it, allow synthesis. So it would practically disable synthesis for U+202F for when the latin subset is in use. - Synthesis in a way violates unicode range as the font is used for a codepoint that may be outside unicode-range. In a way, similar to how we restrict the results of get_nominal_glyph, we would answer with allowing synthesis only for what's within the range.

This is already a "problem" because of the composition/decomposition we do. So you might get a letter shaped that is outside of the unicode-range.

I want to avoid adding a new callback if we can find another way.

Special-case NARROW NO-BREAK SPACE, since it has semantic meaning in Mongolian encoding. Always, or if script is Mongolian.

Do you mean on the HarfBuzz side in shaping or clustering?

I meant on the HarfBuzz side.

@drott
Copy link
Collaborator Author

drott commented Nov 22, 2023

Works for me if this can be addressed inside of HarfBuzz. Agree that composition/decomposition blurs those lines, too.

@drott drott changed the title Consider providing option for restricting space synthesis to stay within unicode-range Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range Nov 22, 2023
@behdad
Copy link
Member

behdad commented Nov 22, 2023

@jfkthame WDYT about not replacing NNSP ever?

@jfkthame
Copy link
Collaborator

@jfkthame WDYT about not replacing NNSP ever?

I'd be hesitant to do that -- as long as we have a general behavior of synthesizing fallbacks for known Unicode "space" characters that aren't supported by the chosen font, we should do our best to support all of them.

Making this a special-case exception for U+202F in Mongolian script would be OK, I guess. But really the caller should be choosing the appropriate font before calling the shaper.

@behdad
Copy link
Member

behdad commented Nov 22, 2023

But really the caller should be choosing the appropriate font before calling the shaper.

That's a chicken & egg issue because of the normalization step we do in HB; hence the shaper-driven approach Chrome takes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chrome Chrome/Chromium project related issues and requests
Projects
None yet
Development

No branches or pull requests

3 participants