Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmap encoding selection: unicodeEncoding vs. microsoftUCS4Encoding #44

Open
xianpingge opened this issue Dec 19, 2016 · 2 comments
Open

Comments

@xianpingge
Copy link

I'm dumping the glyphs from HanaMinB.ttf
( available at

https://osdn.net/frs/redir.php?m=pumath&f=%2Fhanazono-font%2F64385%2Fhanazono-20160201.zip

), where most of the characters are > U+FFFF.

Enclosed please find the output of
ttfdump -t cmap HanaMinB.ttf

According to the ttfdump output, this ttf file contains 4 cmap
subtables, covering the 4 encodings defined in truetype.go:

unicodeEncoding = 0x00000003 // PID = 0 (Unicode), PSID = 3 (Unicode 2.0)
microsoftSymbolEncoding = 0x00030000 // PID = 3 (Microsoft), PSID = 0 (Symbol)
microsoftUCS2Encoding = 0x00030001 // PID = 3 (Microsoft), PSID = 1 (UCS-2)
microsoftUCS4Encoding = 0x0003000a // PID = 3 (Microsoft), PSID = 10 (UCS-4)

And the current code selects the first one (unicodeEncoding):

pidPsid := u32(table, offset)
// We prefer the Unicode cmap encoding. Failing to find that, we fall
// back onto the Microsoft cmap encoding.
if pidPsid == unicodeEncoding {
bestOffset, bestPID, ok = offset, pidPsid>>16, true
break
} else if pidPsid == microsoftSymbolEncoding ||
pidPsid == microsoftUCS2Encoding ||
pidPsid == microsoftUCS4Encoding {
bestOffset, bestPID, ok = offset, pidPsid>>16, true
// We don't break out of the for loop, so that Unicode can override Microsoft.
}

and none of the >U+FFFF characters are available.

Should we prefer microsoftUCS4Encoding to the
16-bit-only unicodeEncoding ?

HanaMinB.ttf-dump-cmap.txt

@nigeltao
Copy link
Contributor

Yeah, we should probably prefer microsoftUCS4Encoding.

@nigeltao
Copy link
Contributor

An alternative is to also accept PID = 0 (Unicode), PSID = 4 (Unicode 2.0, full repertoire, i.e. not restricted to the Basic Multilingual Plane). FWIW, ttx shows me 5 cmap subtables, not 4, for HanaMinB.ttf.

Also, the code as is prefers Unicode to Microsoft cmap encodings, but I can't remember the reason why, and maybe we don't need to. We should probably prefer cmap format 12 tables over cmap format 4, though, for the greater (non-BMP) range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants