Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK Compatibility Ideographs #234

Open
paulmasson opened this issue Apr 22, 2024 · 5 comments
Open

CJK Compatibility Ideographs #234

paulmasson opened this issue Apr 22, 2024 · 5 comments

Comments

@paulmasson
Copy link
Contributor

This issue is in reference to the recent commit a74696d. While I agree that most of these characters add no information to this part of the database, there a few cases that should be revisited.

These two characters were recently added by me, the first in #124, the second in #133:

U+F9A8 令 kPhonetic 812
U+FA5B 者 kPhonetic 94

Both of these variants appear in Casey, which is why I added them. These should be restored.

This character was add recently by me in #49 as part of issue #48 :

U+2F879 峀 kPhonetic 1512*

This variant draws the bottom half of the character in a way that shows the connection of U+5CC0 峀 to the group more clearly. Unless that visualization is not assured across platforms, it too should be restored.

This character has two variants in Casey, one without a dot and one with a dot:

U+F970 殺 kPhonetic 1111

Again, unless that visualization is not assured across platforms, it too should be restored.

Finally, the group from which these two characters were removed has four entries in Casey:

U+F98E 年 kPhonetic 977
U+F995 秊 kPhonetic 977

One my devices these two render precisely the same as the other two characters, so they don't appear to capture the information in Casey. I am ambivalent about restoring these two.

@kenlunde
Copy link
Member

None of these should be restored. I will explain tomorrow when my mind is fresh.

@kenlunde
Copy link
Member

All CJK Compatibility Ideographs normalize to corresponding CJK Unified Ideographs, and the CJK Unified Ideographs to which CJK Compatibility Ideographs are normalized are referred to as canonical equivalents. The following are the canonical equivalents for the six ones that you cited, all of which are associated with the same kPhonetic property values:

U+F9A8 令 = U+4EE4 令 (565 812)
U+FA5B 者 = U+8005 者 (94)

The above CJK Compatibility Ideographs have a K- or J-source, and if you look at the code chart glyphs for their canonical equivalents, you will see the same glyphs under two of the sources.

U+2F879 峀 = U+5CC0 峀 (1512*)

The above CJK Compatibility Ideograph will soon be orphaned, probably for Unicode Version 17.0 (2025), because U+5CC0 峀 will likely be disunified per document WG2 N5259 (aka IRG N2676 + ROK feedback):

https://www.unicode.org/wg2/docs/n5259-IRGN2676Disunify5CC0.pdf

The likely code point of the disunified form, which looks like U+2F879 峀, is U+2B73A.

U+F970 殺 = U+6BBA 殺 (46 1111 1281)
U+F98E 年 = U+5E74 年 (977)
U+F995 秊 = U+79CA 秊 (192 977)

The above three CJK Compatibility Ideographs are considered true duplicates of their canonical equivalents, at least when it comes to the K-source of their canonical equivalents.

Keep in mind that how ideographs appear on a particular platform depends on several factors, such as the platform itself (macOS versus Windows), the available fonts, and the language settings of the OS. It is always best to avoid CJK Compatibility Ideographs. WG2 and the UTC stopped accepting them over 10 years ago due to the issues that they cause.

@paulmasson
Copy link
Contributor Author

From a technological point of view, I understand why you would want to discourage the use of compatibility ideographs in favor of their canonical equivalents. What bothers me is that Casey has cases, as noted above, where he explicitly includes variants of the root phonetic. Someone comparing Casey to the database will see discrepancies for these cases. How do you make it clear to that person that the data is accurate?

At the very least, the description of kPhonetic in the documentation should state that compatibility ideographs are explicitly excluded from this field.

@kenlunde
Copy link
Member

kenlunde commented May 2, 2024

I may or may not have time to sufficiently explain this issue before I fly to Japan on Star Wars Day, but the main thing to consider is that relying on the glyphs that the OS displays is not a good way of determining that the property value is appropriate. It is better to use the multicolumn code charts for the 10 CJK Unified Ideographs blocks for this purpose.

@kenlunde
Copy link
Member

kenlunde commented May 2, 2024

For example, consider U+F970 殺 (1111) versus U+6BBA 殺 (46 1111 1281). Both forms—with and without the dot—appear in the multicolumn entry for U+6BBA.

U+6BBA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants