Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

russcam · 2024-04-22T12:30:14Z

CJKRadicals-15.1.0.txt uses apostrophes after the radical number to indicate that the ideograph uses a standard simplification. From Unicode® Standard Annex #38 UNICODE HAN DATABASE (UNIHAN):

A single apostrophe indicates the Chinese simplified form of the radical (for example, U+9F7F 齿 for U+9F52 齒) and two apostrophes indicate the non-Chinese simplified form of the radical (for example, U+6B6F 歯 for U+9F52 齒).

The ProcessCjkRadicalsFile method handles the single apostrophe case, but throws on the two apostrophe case at

NetUnicodeInfo/System.Unicode.Build.Core/UnicodeDataProcessor.cs

Line 246 in 16ae6bc

    
           int radicalIndex = int.Parse(isSimplified ? radicalIndexText.Substring(0, radicalIndexText.Length - 1) : radicalIndexText);

Note also that the non-Chinese simplified form of the radical can have an empty CJK radical character if the CJK radical character is not included in the Kangxi Radicals block or the CJK Radicals Supplement block, so the following would also need to handle an empty character

NetUnicodeInfo/System.Unicode.Build.Core/UnicodeDataProcessor.cs

Line 251 in 16ae6bc

    
           char radicalCodePoint = checked((char)int.Parse(reader.ReadTrimmedField(), NumberStyles.HexNumber));

I'd be happy to add support for the non-Chinese simplified form. How would you prefer to represent an empty character on CjkRadicalData - as char?

The text was updated successfully, but these errors were encountered:

hexawyz · 2024-04-24T18:31:01Z

Oh, that's great, another breaking update to the database 😅

From what I understand, what they call "non-Chinese" are actually japanese characters. (The one they give as example is the japanese kanji for tooth: 歯)
Before updating this, I'll do a quick sanity check that there is no weird stuff going here, but the best solution would be to have "Chinese Simplified" and "Japanese Simplified" properties. (AFAIK, PRC and Japan are the only two countries having applied an official simplification process of the chinese characters, so hopefully there won't be an exception)

hexawyz · 2024-04-24T19:31:18Z

So, I checked, and…
For radical 182, I'm not sure where it comes from 🙁
For radical 208, it is indeed a Japanese kanji, but a lesser used variant. (And also not a radical? Traditional one is still the official radical)
Others seem to be ok.

I don't really know what to make out of it. It would seem that when the radical field is empty it means that the character is an alternate (simplified) writing and not a proper radical, but that's a weird way to reference words here… 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

russcam commented Apr 22, 2024

hexawyz commented Apr 24, 2024

hexawyz commented Apr 24, 2024

Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

Comments

russcam commented Apr 22, 2024

hexawyz commented Apr 24, 2024

hexawyz commented Apr 24, 2024