Chinese Private Use Area code points

Many decoders for legacy Chinese encodings produce PUA code points for certain characters. Such assignments can cause problems as multiple PUA agreements exist. Since almost all of these characters have formal assignments, PUA is no longer necessary for expressing characters. For consistency, it is generally desirable to normalize such characters to actual formal values. ChineseUtils.normalize warns against PUA code points found in strings.

This article contains a Python 3 script that replaces such code points with formal ones.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'''Normalizes PUA code points generated by decoders. Released under CC0.'''
import argparse
import os
import sys

conv = {}

GB 18030 & GBK

GB 18030 & GBK used a total of 95 PUA code points, and even the latest (2005) version of GB 18030 contains 24 PUA code points.

The PUA codepoint ranges are U+E7C7–U+E8C8, U+E7E7–U+E7F3, and U+E816–U+E864.

# CC0, https://github.com/The-Orizon/nlputils/blob/master/gbk_pua.py
_gbk_table_bmp = ((
    '\ue78d\ue78e\ue78f\ue790\ue791\ue792\ue793\ue794\ue795\ue7c7\ue7c8'
    '\ue7e7\ue7e8\ue7e9\ue7ea\ue7eb\ue7ec\ue7ed\ue7ee\ue7ef\ue7f0'
    '\ue7f1\ue7f2\uf7f3\ue815\ue819\ue81a\ue81b\ue81c\ue81d\ue81e'
    '\ue81f\ue820\ue821\ue822\ue823\ue824\ue825\ue826\ue827\ue828'
    '\ue829\ue82a\ue82b\ue82c\ue82d\ue82e\ue82f\ue830\ue832'
    '\ue833\ue834\ue835\ue836\ue837\ue838\ue839\ue83a\ue83c'
    '\ue83d\ue83e\ue83f\ue840\ue841\ue842\ue843\ue844\ue845\ue846'
    '\ue847\ue848\ue849\ue84a\ue84b\ue84c\ue84d\ue84e\ue84f\ue850'
    '\ue851\ue852\ue853\ue854\ue856\ue857\ue858\ue859\ue85a'
    '\ue85b\ue85c\ue85d\ue85e\ue85f\ue860\ue861\ue862\ue863\ue864'), (
    '︐︒︑︓︔︕︖︗︘ḿǹ〾⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻'  # ，。、：；！？〖〗
    '⺁⺄㑳㑇⺈⺋龴㖞㘚㘎⺌⺗㥮㤘龵㧏㧟㩳㧐龶龷㭎㱮㳠⺧龸⺪䁖䅟⺮䌷⺳'
    '⺶⺷䎱䎬⺻䏝䓖䙡䙌龹䜣䜩䝼䞍⻊䥇䥺䥽䦂䦃䦅䦆䦟䦛䦷䦶龺䲣䲟䲠䲡䱷䲢䴓'
    '䴔䴕䴖䴗䴘䴙䶮龻'
))
_gbk_table = (('\ue816\ue817\ue818\ue831\ue83b\ue855'), ('𠂇𠂉𠃌𡗗𢦏𤇾'))
conv['gbk_bmp'] = str.maketrans(*_gbk_table_bmp)
conv['gbk'] = dict(conv['gbk_bmp'])
conv['gbk'].update(str.maketrans(*_gbk_table))

Big5-HKSCS

The HKSCS extension of Big5 uses EUDC areas in Big5, which are mapped to PUA code points by naive Big5 decoders. However, as the characters in HKSCS are well-defined, publishers of the HKSCS have provided separate mappings for the extended parts where available. PUA-free mappings are available for 2004 and 2008 versions of the standard.

PUA ranges used for Big5 EUDC range from U+E000 to U+F848, and for Big5-HKSCS (not mapped in EUDC ranges 81 40–86 FE), the first 6 × 157 = 942 code points (i.e. U+E000–U+E3AE) are unused.

The following mapping generation depends on New2003cmp_2008.txt.

# The table seems too long to be included here.
# You can dump the tables out to replace this chunk of code.
conv['hkscs_bmp'] = {}
conv['hkscs'] = {}
try:
    # http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/New2003cmp_2008.txt
    with open('New2003cmp_2008.txt') as hkscs_pua_map:
        import re
        start = False
        for entry in hkscs_pua_map:
            c_pua, c_uni, big5 = entry.split()
            if not started:
                if c_pua[0] == '=':
                    started
                continue

            try:
                _ = int(c_big5, 16)
            except ValueError:
                continue

            if c_uni[0] == '<':
                conv['hkscs_bmp'][c_pua] = ''.join(
                                map(lambda hex: chr(int(hex, 16)),
                                    c_uni[1:-1].split(',')))
            else:
                cp_uni = int(c_uni, 16)
                if cp_uni < 0xFFFF:
                    conv['hkscs_bmp'][c_pua] = cp_uni
                else:
                    conv['hkscs'][c_pua] = cp_uni
except:
    import traceback
    print("Failed to load HKSCS:")
    traceback.print_exception(*sys.exc_info())

conv['hkscs_bmp']
conv['hkscs'].update(conv['hkscs_bmp'])

GCCS

GCCS is the precursor of HKSCS, which included a few characters later unified with others in HKSCS. A compatibility mapping to Big5-HKSCS can be used to map PUA code points generated by Big5 EUDC mapping.

# Omitted, rarely needed.
"""
conv['gccs_hkscs'] should be a str.translate dict that maps obsolete GCCS
code points (as EUDC-PUA) to unified HKSCS code points (non-PUA) when available.

The BMP version of this mapping should provide the PUA code point of the
corresponding HKSCS character if the actual code point is found outside of BMP.
"""

HKSCS-2004 Annex IV (PDF), which contains glyphs for these GCCS code points, may be helpful for re-mapping of non-verifiable characters to Unihan.

Wrap-up

# aliases
conv['zh'] = dict(conv['hkscs'])
conv['zh'].update(conv['gbk'])
conv['bmp'] = dict(conv['hkscs_bmp'])
conv['bmp'].update(conv['gbk_bmp'])

parser = argparse.ArgumentParser(description='Normalize PUA code points '
                                             'for Chinese encodings.')
parser.add_argument('--conv', nargs='?', type=str, default='zh',
                    help=('mappings to use, sorted by fallback priority, '
                          'separated by commas (","). Defaults to "zh". '
                          'Available mappings:' + ', '.join(conv.keys())))
parser.add_argument('--inplace', action='store_true', default=False,
                    help='perform in-place conversion, suppress stdout')
parser.add_argument('--isuffix', type=str, default='', nargs='?',
                    help='suffix for backup file in in-place mode')
parser.add_argument('files', nargs='*', type=str)
args = parser.parse_args()

realconv = {}
for k in reversed(args.conv.split(',')):
    realconv.update(conv[k])


def cat_conv(f):
    for ln in f:
        sys.stdout.write(ln.translate(realconv))

if args.files:
    if args.inplace:
        for file in args.files:
            f = open(file, 'r')
            s = f.read()
            f.close()

            if args.isuffix:
                os.rename(file, file + args.isuffix)

            f = open(file, 'w')
            f.write(s.translate(realconv))
            f.close()
    else:
        for file in args.files:
            with open(file) as f:
                cat_conv(f)
else:
    cat_conv(sys.stdin)

This page was previously named as "Chinese Private Use Area codepoints".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese Private Use Area code points

GB 18030 & GBK

Big5-HKSCS

GCCS

Wrap-up

Clone this wiki locally