Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

Open
russcam opened this issue Apr 22, 2024 · 2 comments
Open

Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

russcam opened this issue Apr 22, 2024 · 2 comments

Comments

@russcam
Copy link

russcam commented Apr 22, 2024

CJKRadicals-15.1.0.txt uses apostrophes after the radical number to indicate that the ideograph uses a standard simplification. From Unicode® Standard Annex #38 UNICODE HAN DATABASE (UNIHAN):

A single apostrophe indicates the Chinese simplified form of the radical (for example, U+9F7F 齿 for U+9F52 齒) and two apostrophes indicate the non-Chinese simplified form of the radical (for example, U+6B6F 歯 for U+9F52 齒).

The ProcessCjkRadicalsFile method handles the single apostrophe case, but throws on the two apostrophe case at

int radicalIndex = int.Parse(isSimplified ? radicalIndexText.Substring(0, radicalIndexText.Length - 1) : radicalIndexText);

Note also that the non-Chinese simplified form of the radical can have an empty CJK radical character if the CJK radical character is not included in the Kangxi Radicals block or the CJK Radicals Supplement block, so the following would also need to handle an empty character

char radicalCodePoint = checked((char)int.Parse(reader.ReadTrimmedField(), NumberStyles.HexNumber));

I'd be happy to add support for the non-Chinese simplified form. How would you prefer to represent an empty character on CjkRadicalData - as char?

@hexawyz
Copy link
Owner

hexawyz commented Apr 24, 2024

Oh, that's great, another breaking update to the database 😅

From what I understand, what they call "non-Chinese" are actually japanese characters. (The one they give as example is the japanese kanji for tooth: 歯)
Before updating this, I'll do a quick sanity check that there is no weird stuff going here, but the best solution would be to have "Chinese Simplified" and "Japanese Simplified" properties. (AFAIK, PRC and Japan are the only two countries having applied an official simplification process of the chinese characters, so hopefully there won't be an exception)

@hexawyz
Copy link
Owner

hexawyz commented Apr 24, 2024

So, I checked, and…
For radical 182, I'm not sure where it comes from 🙁
For radical 208, it is indeed a Japanese kanji, but a lesser used variant. (And also not a radical? Traditional one is still the official radical)
Others seem to be ok.

I don't really know what to make out of it. It would seem that when the radical field is empty it means that the character is an alternate (simplified) writing and not a proper radical, but that's a weird way to reference words here… 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants