Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenation minimal word length and casing are not UTF8-compliant #2018

Open
Omikhleia opened this issue May 6, 2024 · 0 comments · May be fixed by #2019
Open

Hyphenation minimal word length and casing are not UTF8-compliant #2018

Omikhleia opened this issue May 6, 2024 · 0 comments · May be fixed by #2019
Labels
bug Software bug issue
Milestone

Comments

@Omikhleia
Copy link
Member

Omikhleia commented May 6, 2024

Issue

Relates to #2017 regarding the hard-coded minWord = 5 value, but it's however a different type of issue here:

The logic is not UTF-8 compliant:

if string.len(text) < self.minWord then return { text } end
local points = self.exceptions[text:lower()]
local word = SU.splitUtf8(text)
if not points then
points = SU.map(function ()return 0 end, word)
local work = SU.map(string.lower, word)

  • Use of string.len is not UTF8-safe, so the minWord value is likely not honored as it ought
    • It seems to correspond to LuaTeX's hyphenationmin
    • It could be argued that it's in bytes here, but then it's not in line then with leftmin/rightmin (which are counted with respect to characters)...
  • Use of text:lower() is not UTF8-safe
  • Likewise regarding the string.lower call a bit later in a SU.map (... and aren't we performing the lowercase operation again? somehow acceptable)

Proofs / Minimal examples

The second case here, with minWord at 6, would be expected not to hyphenate "léris":

> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> SILE._hyphenators["fr"].minWord
5
> SILE._hyphenators["fr"].minWord = 6
> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> -- OOPS. "léris" is 5-character long (but 6-byte long)
> SILE._hyphenators["fr"].minWord = 7
> SILE.showHyphenationPoints("léris", "fr")
léris

We override a pattern below, but it doesn't work with an uppercase input (bypassing the exception).

> SILE.call("hyphenator:add-exceptions", { lang="fr" }, { "légè-rement" })% Override as exception
> SILE.showHyphenationPoints("légèrement", "fr")
légè-rement
> SILE.showHyphenationPoints("LÉGÈREMENT", "fr")
LÉGÈ-RE-MENT
> -- OOPS, expected "LÉGÈ-REMENT"
@Omikhleia Omikhleia added the bug Software bug issue label May 6, 2024
@Omikhleia Omikhleia linked a pull request May 6, 2024 that will close this issue
@Omikhleia Omikhleia added this to the v0.15.0 milestone May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software bug issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant