Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

luatex + harfbuzz and the zero width joiner U+200D #418

Open
ralessi opened this issue Feb 29, 2020 · 7 comments
Open

luatex + harfbuzz and the zero width joiner U+200D #418

ralessi opened this issue Feb 29, 2020 · 7 comments

Comments

@ralessi
Copy link

ralessi commented Feb 29, 2020

In some cases, namely when commands are inserted between characters, luatex + harfbuzz do not seem to handle the zero width joiner character (U+200D) properly. Consider the following example, to be compiled with lualatex-dev:

\documentclass[12pt]{article}
\usepackage{fontspec}

\newfontfamily\arabicfont{Amiri}[Script=Arabic]
\newfontfamily\arabicfonthb{Amiri}[Script=Arabic,Renderer=Harfbuzz]

\usepackage{ulem}

\begin{document}

\textdir TRT\arabicfont
دَخَلَ مُب‍\uline{‍تَ‍}‍سِمًا

\medskip

\textdir TRT\arabicfonthb
دَخَلَ مُب‍\uline{‍تَ‍}‍سِمًا

\end{document}

test-zwj

@u-fischer
Copy link
Member

I don't get your output with the development version of luaotfload. With it is looks like this:

image

This is still not correct, but

  • I don't think it is a fontspec issue, but should better be reported in the luaotfload github
  • Probably it is even not a luaotfload issue: you are inserting a rule between the chars and harfbuzz doesn't like this.
  • you should probably do the underlining with lua code to avoid this side effect. See e.g this code from @zauguin: https://tex.stackexchange.com/a/446488/2388

@ralessi
Copy link
Author

ralessi commented Feb 29, 2020

Thank you for the references which I will explore. I suspected that this might be unrelated to fontspec. Do you think it should be worth reporting this---maybe unrelated again---issue to the luaotfload bug tracker?

@khaledhosny
Copy link
Contributor

khaledhosny commented Mar 1, 2020

FWIW, this seems to be a regression in luaotfload. Trying the following with harflatex and the old harf code:

\documentclass[12pt]{minimal}
\usepackage{harfload}
\usepackage{ulem}
\begin{document}

\font\arabicfont="[Amiri-Regular.ttf]:mode=harf"
\textdir TRT\arabicfont
مُب^^^^200d\uline{^^^^200dتَ^^^^200d}^^^^200dسِم

\end{document}

Gives:

@zauguin
Copy link
Member

zauguin commented Mar 3, 2020

This was a luaotfload bug which is resolved in the latest dev branch.

@zauguin
Copy link
Member

zauguin commented Mar 3, 2020

The behavior of HarfBuzz seems a bit odd here but I don't know enough about the script to say if it is a bug or expected behaviour:

The luaotfload bug was that in \hboxes the direction wasn't recognized correctly. So the \uline argument was set as TLT instead of TRT.

Now to the odd part: For some reason, HarfBuzz seems to reverse the cluster with the arabic characters and ignore the previous ZWJ. This can be reproduced with hb-shape:

hb-shape --direction=rtl --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=arab --unicodes=U+200D,U+062A,U+064E,U+200D

gives

[space=1+0|uni064E=1@-188,0+0|uni062A.medi=1+244|space=0+0]

as expected, but replacing --direction=rtl with --direction=ltr gives

[space=0+0|space=1+0|uni064E=1@-212,0+0|uni062A.init=1+190]

Especially both space glyphs representing the ZWJs are at the beginning and the initial form is used.

@khaledhosny Is this supposed to happen?

@khaledhosny
Copy link
Contributor

Yes, sort of.

HarfBuzz wants to shape scripts in their native direction. So when setting a direction other than the native direction for a script, HarfBuzz will reverse the buffer before shaping. It will also avoid breaking grapheme clusters, as one does not want, say, a mark to precede its base. ZWJ is a grapheme extender, so the first ZWJ is consider a grapheme cluster by itself (as it extends nothing) and the base+mark+ZWJ are considered another grapheme cluster.

<U+200D>,<U+062A,U+064E,U+200D>

After reversal:

<U+062A,U+064E,U+200D>,<U+200D>

After shaping the buffer will be reversed again since the native direction is RTL (a simple reversal this time with no grapheme clusters business).

U+062A,U+064E,U+200D,U+200D

After reversal:

U+200D,U+200D,U+064E,U+062A

If you set the script to latn when the direction is ltr, no reversal will happen:

 $ hb-shape --direction=ltr --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=latn --unicodes=U+200D,U+062A,U+064E,U+200D
 [space=0+0|uni062A=1+926|uni064E=1+0|space=1+0]

latn with rtl will do the initial reversal but not the last one:

$ hb-shape --direction=rtl --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=latn --unicodes=U+200D,U+062A,U+064E,U+200D
[uni062A=1+926|uni064E=1+0|space=1+0|space=0+0]

Shaping a script in a direction other than its native direction is risky and unlikely to always give meaningful result.

@zauguin
Copy link
Member

zauguin commented Mar 4, 2020

@khaledhosny Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants