Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range #4503

drott · 2023-11-22T10:18:41Z

See details in https://crbug.com/1499787

When shaping --unicodes 180e,1821,202f,1836,1822 with Noto Sans Mongolian subsetted to latin - the U+202F space is synthesized with the space from the ASCII range, which then leads to breaking shaping down the line when shaping with the Mongolian subset.

Space synthetisation is generally useful and we don't want to plainly switch it off, but it would be useful if HarfBuzz could be told to stay within a specified unicode range or call back to the client to ask "can_synthesize_space_for?" with the codepoint from the input buffer for which a space is about to be synthesized.

Your thoughts are welcome on this issue.

The text was updated successfully, but these errors were encountered:

behdad · 2023-11-22T11:21:20Z

We already do that, don't we?

harfbuzz/src/hb-ot-shape-normalize.cc

Line 196 in 258f2a2

    
           (c->font->get_nominal_glyph (0x0020, &space_glyph) || (space_glyph = buffer->invisible)))

drott · 2023-11-22T13:36:15Z

If I read this code right, it asks for U+0020 space glyphs through the glyph lookup callback function. That succeeds when we try the first (from the bottom) from the Noto Sans Mongolian subsets (from the CSS). But what we would need is a callback the other way round: Ask if it's okay to synthesize for U+202F from the unicode range U+0000-00FF. The synthesized space breaks the run, as that shaped glyph completes from the latin set, later the unshaped parts don't form the connection with the U+202F anymore.

behdad · 2023-11-22T13:40:41Z

How would your callback know what to answer?

Ways I see around this are:

Special-case NARROW NO-BREAK SPACE, since it has semantic meaning in Mongolian encoding. Always, or if script is Mongolian.
Somehow cluster NNBS with previous cluster.
In Chrome completely disable space synthesis, and only if a space character could not be shaped, then shape it with space synthesis enabled.

drott · 2023-11-22T14:02:43Z

How would your callback know what to answer?

The callback would look at unicode-range of the current subset and only if U+202F would be in the unicode-range of it, allow synthesis. So it would practically disable synthesis for U+202F for when the latin subset is in use. - Synthesis in a way violates unicode range as the font is used for a codepoint that may be outside unicode-range. In a way, similar to how we restrict the results of get_nominal_glyph, we would answer with allowing synthesis only for what's within the range.

Special-case NARROW NO-BREAK SPACE, since it has semantic meaning in Mongolian encoding. Always, or if script is Mongolian.

Do you mean on the HarfBuzz side in shaping or clustering?

behdad · 2023-11-22T14:07:25Z

How would your callback know what to answer?

The callback would look at unicode-range of the current subset and only if U+202F would be in the unicode-range of it, allow synthesis. So it would practically disable synthesis for U+202F for when the latin subset is in use. - Synthesis in a way violates unicode range as the font is used for a codepoint that may be outside unicode-range. In a way, similar to how we restrict the results of get_nominal_glyph, we would answer with allowing synthesis only for what's within the range.

This is already a "problem" because of the composition/decomposition we do. So you might get a letter shaped that is outside of the unicode-range.

I want to avoid adding a new callback if we can find another way.

Special-case NARROW NO-BREAK SPACE, since it has semantic meaning in Mongolian encoding. Always, or if script is Mongolian.

Do you mean on the HarfBuzz side in shaping or clustering?

I meant on the HarfBuzz side.

drott · 2023-11-22T14:28:06Z

Works for me if this can be addressed inside of HarfBuzz. Agree that composition/decomposition blurs those lines, too.

behdad · 2023-11-22T14:29:43Z

@jfkthame WDYT about not replacing NNSP ever?

jfkthame · 2023-11-22T14:45:54Z

@jfkthame WDYT about not replacing NNSP ever?

I'd be hesitant to do that -- as long as we have a general behavior of synthesizing fallbacks for known Unicode "space" characters that aren't supported by the chosen font, we should do our best to support all of them.

Making this a special-case exception for U+202F in Mongolian script would be OK, I guess. But really the caller should be choosing the appropriate font before calling the shaper.

behdad · 2023-11-22T14:47:44Z

But really the caller should be choosing the appropriate font before calling the shaper.

That's a chicken & egg issue because of the normalization step we do in HB; hence the shaper-driven approach Chrome takes.

drott added the Chrome Chrome/Chromium project related issues and requests label Nov 22, 2023

drott changed the title ~~Consider providing option for restricting space synthesis to stay within unicode-range~~ Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range #4503

Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range #4503

drott commented Nov 22, 2023

behdad commented Nov 22, 2023

drott commented Nov 22, 2023

behdad commented Nov 22, 2023

drott commented Nov 22, 2023 •

edited

behdad commented Nov 22, 2023

drott commented Nov 22, 2023

behdad commented Nov 22, 2023

jfkthame commented Nov 22, 2023

behdad commented Nov 22, 2023

Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range #4503

Space synthesis breaks Mongolian shaping in cascade through subsetted Noto Sans Mongolian despite unicode-range #4503

Comments

drott commented Nov 22, 2023

behdad commented Nov 22, 2023

drott commented Nov 22, 2023

behdad commented Nov 22, 2023

drott commented Nov 22, 2023 • edited

behdad commented Nov 22, 2023

drott commented Nov 22, 2023

behdad commented Nov 22, 2023

jfkthame commented Nov 22, 2023

behdad commented Nov 22, 2023

drott commented Nov 22, 2023 •

edited