Improve support for international character sets #74

ali1234 · 2023-02-12T00:48:44Z

I have a plan to do this by using the codecs module to register all the different teletext charset options as codecs, so that packet bytes can be directly converted to unicode using eg bytes.decode('teletext-latin-1') to indicate the Latin G0/G1 set with national option 1.

This can also go the other way by saying str.encode('teletext-latin-1') and doing bytes -> str -> bytes this way should produce an identical string - with the caveat that any parity errors on the original bytes will be "fixed". It may be possible to control this behaviour, allowing the user to raise an exception on parity errors.

Note that, when decoding, Teletext spacing attributes 0x00 - 0x1f will be mapped to Unicode C0 0x00 - 0x1f. In other words they will be left untouched. When encoding a string that uses a mixture of G0 and G2 (mosaics), the appropriate control codes could be inserted in the string, but again this behaviour could be optional and it could instead raise an exception.

Once this is done, the Printer class and subclasses can be simplified to just use codecs, and then they will only have to worry about converting the C0 characters to ANSI, HTML, or simply removing them entirely (ie replace them with spaces or the current held mosaic.)

It turns out that there may need to be multiple different Unicode mappings, because some environments can only render Unicode characters from the basic multilingual plane, ie <= 0xffff. The mosaic characters in Unicode are outside this plane. It should be possible to map all alphanumerics though. Note that ZVBI uses a mapping that places Arabic alphanumerics in the private use area.

Steps:

1. Build the tables of teletext <-> unicode mappings. (In progress, see https://al.zerostem.io/~al/ttcharset/)
2. Register the codec functions for conversion.
3. Make low level unit tests for the codecs.
4. Refactor the console/html converters to use codecs.
5. Make a generator for a series of teletext pages for each charset.
6. Add support for P/28 enhancement packets - these are per-page so should be relatively easy. Better codepage support #63
7. Add support for M/29 enhancements - these are per-magazine so a bit more tricky. Better codepage support #63
8. Add command line support for "local code of practice", ie the default charset when P/28 and M/29 are not present. Better codepage support #63

The text was updated successfully, but these errors were encountered:

ali1234 added feature refactor labels Feb 12, 2023

ali1234 self-assigned this Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for international character sets #74

Improve support for international character sets #74

ali1234 commented Feb 12, 2023 •

edited

Improve support for international character sets #74

Improve support for international character sets #74

Comments

ali1234 commented Feb 12, 2023 • edited

ali1234 commented Feb 12, 2023 •

edited