Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for international character sets #74

Open
8 tasks
ali1234 opened this issue Feb 12, 2023 · 0 comments
Open
8 tasks

Improve support for international character sets #74

ali1234 opened this issue Feb 12, 2023 · 0 comments
Assignees

Comments

@ali1234
Copy link
Owner

ali1234 commented Feb 12, 2023

I have a plan to do this by using the codecs module to register all the different teletext charset options as codecs, so that packet bytes can be directly converted to unicode using eg bytes.decode('teletext-latin-1') to indicate the Latin G0/G1 set with national option 1.

This can also go the other way by saying str.encode('teletext-latin-1') and doing bytes -> str -> bytes this way should produce an identical string - with the caveat that any parity errors on the original bytes will be "fixed". It may be possible to control this behaviour, allowing the user to raise an exception on parity errors.

Note that, when decoding, Teletext spacing attributes 0x00 - 0x1f will be mapped to Unicode C0 0x00 - 0x1f. In other words they will be left untouched. When encoding a string that uses a mixture of G0 and G2 (mosaics), the appropriate control codes could be inserted in the string, but again this behaviour could be optional and it could instead raise an exception.

Once this is done, the Printer class and subclasses can be simplified to just use codecs, and then they will only have to worry about converting the C0 characters to ANSI, HTML, or simply removing them entirely (ie replace them with spaces or the current held mosaic.)

It turns out that there may need to be multiple different Unicode mappings, because some environments can only render Unicode characters from the basic multilingual plane, ie <= 0xffff. The mosaic characters in Unicode are outside this plane. It should be possible to map all alphanumerics though. Note that ZVBI uses a mapping that places Arabic alphanumerics in the private use area.

Steps:

  • 1. Build the tables of teletext <-> unicode mappings. (In progress, see https://al.zerostem.io/~al/ttcharset/)
  • 2. Register the codec functions for conversion.
  • 3. Make low level unit tests for the codecs.
  • 4. Refactor the console/html converters to use codecs.
  • 5. Make a generator for a series of teletext pages for each charset.
  • 6. Add support for P/28 enhancement packets - these are per-page so should be relatively easy. Better codepage support #63
  • 7. Add support for M/29 enhancements - these are per-magazine so a bit more tricky. Better codepage support #63
  • 8. Add command line support for "local code of practice", ie the default charset when P/28 and M/29 are not present. Better codepage support #63
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant