Prototype: unicode string support #1517

cvrunmin · 2023-07-09T13:25:33Z

This aims to address the issue #860 about reading and writing unicode character into terminals.

~~This pull request mostly adapt the first route in the discussion ("separate versions of methods for unicode").~~ (Edit: no longer valid since the commit at 12 July.)

Additions

utflib api

This api provides a UTFString "class" that wraps a utf8-encoded byte string and act as a normal string. Functions that are provided in the standard string library, except string.dump, string.pack, string.packsize, string.unpack, are also provided in UTFString. Users can use this to adapt unicode strings into their old system painlessly. If users want to get Latin-1 string from UTFString, they can use UTFString:toLatin(). Otherwise tostring will return the backend byte string.

Besides UTFString, the module also exports the following functions:
1. fromLatin(str): consider the string as fully Latin-1 and convert it into utf8. Such function is provided as UTFString(str) will consider the string as already utf8-encoded, and only consider invalid byte subsequences as Latin1-encoded and convert them.
2. isUTFString(v): return true if v is a UTFString.
3. wrapStr(str): wrap a lua string so that normal string can be compared with unicode string.
4. isStringWrapper(v): return true if v is a string wrapper from wrapStr(str)
shell.unicode, edit.unicode, lua.unicode settings

New settings allows shell, edit and lua programs to receive and print unicode strings. Such settings will not affect other programs, especially user-defined programs.

Changes

Terminal Font Texture now supports unicode characters by dynamically baking them. Currently it simply uses GNU Unifont provided in vanilla Minecraft (also known as Legacy Unicode, uniform, "Force Unicode Font" options)
- Monitor in TBO drawing mode currently does not support it as it involves changing shader code.
TermMethods.write and TermMethods.blit functions
Now they accepts UTFString properly, and do not write "table: 0x??????" on screen.
- Current value check UTFString via duck test, which do not look really nice, and seems to have strong performance impact on edit program when unicode text presents.
char and paste events
Now it also send utf8-encoded string as the second parameter
read function
Now it accepts _bReadUnicode as the 5th argument, indicating whether it should take UTFString when true, or a normal string otherwise.

Roadmap

Drafting apis
Confirming api designs
Writing test script <-- send help - I'm not familiar with designing test cases

Edit

2023-07-12: merge separated unicode modules/functions back to their normal variant to reduce code duplication hell.

9551-Dev · 2023-07-09T14:33:08Z

tbh i do not believe that this has any chance of being merged into CC:T, best of luck but i feel like a lot of features like this have been rejected before because it would conflict with the mods "feel"

SammyForReal · 2023-07-09T14:43:58Z

I really hope that this can find some kind of compromise, because having Unicode support for a terminal kinda is something you'd expect. And it would make stuff so much easier.

Andrew-71 · 2023-07-09T15:36:48Z

Personally fully in support of such addition. This is obviously an enromous change, but I believe the current status quo of only having latin + a few extra chars creates a pointless barrier of entry for people from different cultures and is not the way to go forward, even if this takes a while to polish out. I think the potential to slightly "break the feel", as dev1955 pointed out, is worth it to allow people unfamiliar with english to use the mod.

9551-Dev · 2023-07-09T15:43:58Z

Personally fully in support of such addition. This is obviously an enromous change, but I believe the current status quo of only having latin + a few extra chars creates a pointless barrier of entry for people from different cultures and is not the way to go forward, even if this takes a while to polish out. I think the potential to slightly "break the feel", as dev1955 pointed out, is worth it to allow people unfamiliar with english to use the mod.

I dont think the rom even supports multiple languages so it sounds kinda pointless in making it more accessible for newbies
Also you still have to use Latin for Lua

Im not against this change as i use older versions anyway but this still feels too drastic

Andrew-71 · 2023-07-09T15:47:09Z

I dont think the rom even supports multiple languages so it sounds kinda pointless in making it more accessible for newbies
Also you still have to use Latin for Lua

I guess I worded this a bit poorly, I meant not that newbies would be able to set ROM to their language or code in it, but that non-technical users could potentially interact with CC programs others made in their native tongue e.g. a shop or a dashboard

MCJack123 · 2023-07-09T22:46:22Z

O_o That's a lot of new functions to add and import...

SquidDev · 2023-07-11T10:53:44Z

Thank you for looking into this. I realise this is a bit of a pain, but I think it probably makes sense to do this work in two stages/two PRs:

Internal changes (rendering, Terminal, anything which isn't exposed to user-code): This is quite a complex change in itself, but I think what needs doing is (relatively) well understood. A couple of quick comments on what's here already:
- We need to decide on what unicode features to support. I think my feeling is we should support double-wide/full-width forms, but not RTL or multi-codepoint graphemes (so anything involving ZWJ). That consistent with what Minecraft does (and supporting RTL in a terminal seems pretty a little terrifying).
- It might be worth looking at alternative fonts to Minecraft's - as you mentioned in How to provide Unicode character set support for CCT? #860, the glyph sizes are different, so things may look a little weird. We might be able to do something using a mixture of X11's font (they have a 6x9 variant), Monocraft and one of Bitfont's, with Unifont as a fallback.
It is going to be hard to maintain consistency with CC's existing font (after all, there's very few CJK characters which are recognisable at a 6x9 resolution!), but we can hopefully reduce the dissonance.
User-facing changes (so TermMethods/window, then expanding out from there): This is going to be much harder to get right, and I suspect take several iterations.

I think the current implementation confirms my worst fears from How to provide Unicode character set support for CCT? #860 (comment), in that you end up duplicating a tonne of code.

It might be worthwhile putting together another version of the Lua-side changes based on a cut-down version of Jack's approach (How to provide Unicode character set support for CCT? #860 (comment)). If we convert UTFString to a Java-side userdata type, it should be possible to handle this type inside the Java methods too.

I'm not entirely sure this is the right option either - there's still some questions about how we want to handle receiving strings (both in return values, and from events - I'm not convinced duplicating events is very nice). Like I said, this is going to require some iterations to get right :).

cvrunmin · 2023-07-11T15:03:08Z

Quick response of point 1:

We need to decide on what unicode features to support. I think my feeling is we should support double-wide/full-width forms, but not RTL or multi-codepoint graphemes (so anything involving ZWJ). That consistent with what Minecraft does (and supporting RTL in a terminal seems pretty a little terrifying).
Agree that we should support double-wide forms. Multi-codepoint graphemes screw things so much especially in trimming strings, so we could forget it for now.

As for the cloning hell in point 2, it is because we cannot distinguish when we expect a latin1 string and when we expect a utf8 string. I believe using UTFString userdata could help eliminating most, if not all of the duplication. However it also need a revamp on CC:T so that it support userdata. For example CobaltLuaMachine:toObject completely ignores userdata type.
Still it should be better than making something like this:

term.setUtf(true)
term.write(...)
term.setUtf(false)

that could ruin the subsequent calls if someone forgets to restore state.

For duplicating event issue, would it be better if we send two params for char and paste, one normal string and one UTFString? The ambiguity on codepoint 128-255, whether is should be sent with Latin-1 or UTF-8 encoding concerns me and sending both of them comforts me.

MCJack123 · 2023-07-11T17:44:07Z

I'm not entirely sure this is the right option either - there's still some questions about how we want to handle receiving strings (both in return values, and from events - I'm not convinced duplicating events is very nice).

FWIW, my approach for events in my test was to use the same event name, but adding a second parameter with the UTFString variant. This allows programs that are Unicode-aware to grab the Unicode version without needing to duplicate events.

I didn't do a good job at describing every change I made in my test back then, but you could take a look at the ROM patches for inspiration. I still think this is the most elegant solution, even if it ends up adding a bunch of extra Unicode options to functions.

Also, I don't really like the idea of making people interact with the UTF-8 representations of strings directly. It's really easy to slip up and end up putting it into a normal string function, which would destroy the codepoints and/or not function correctly. It's good that there's a usermode library to help, but IMO we shouldn't be exposing the raw encoding data to users unless they specifically ask for it (e.g. an encode method/serialization).

merge separated unicode modules/functions back to their normal variant.

cvrunmin added 6 commits July 5, 2023 21:35

unicode support (WIP)

addc8e6

use quadtree-like struct for font texture node

a43d963

Merge branch 'cc-tweaked:mc-1.19.x' into unicode-proto

1739f2f

fix font texture resizing and node insertion of fullwidth char

cd75c1f

real unicode string pattern matching

71a4ac2

monitor rendering and more

a07a1b0

cvrunmin added 6 commits July 12, 2023 22:46

those diverged will eventually reunited

cf46687

merge separated unicode modules/functions back to their normal variant.

use vanilla font system for terminal

abf5c4b

fix lua script

2ed74cd

Merge branch 'cc-tweaked:mc-1.19.x' into unicode-proto

0762afe

full-width support and unicode suppl char

bf9b982

utfstring in java

fd0918e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype: unicode string support #1517

Prototype: unicode string support #1517

cvrunmin commented Jul 9, 2023 •

edited

9551-Dev commented Jul 9, 2023

SammyForReal commented Jul 9, 2023

Andrew-71 commented Jul 9, 2023 •

edited

9551-Dev commented Jul 9, 2023

Andrew-71 commented Jul 9, 2023

MCJack123 commented Jul 9, 2023

SquidDev commented Jul 11, 2023

cvrunmin commented Jul 11, 2023

MCJack123 commented Jul 11, 2023 •

edited

Prototype: unicode string support #1517

Are you sure you want to change the base?

Prototype: unicode string support #1517

Conversation

cvrunmin commented Jul 9, 2023 • edited

Additions

utflib api

shell.unicode, edit.unicode, lua.unicode settings

Changes

TermMethods.write and TermMethods.blit functions

char and paste events

read function

Roadmap

Edit

9551-Dev commented Jul 9, 2023

SammyForReal commented Jul 9, 2023

Andrew-71 commented Jul 9, 2023 • edited

9551-Dev commented Jul 9, 2023

Andrew-71 commented Jul 9, 2023

MCJack123 commented Jul 9, 2023

SquidDev commented Jul 11, 2023

cvrunmin commented Jul 11, 2023

MCJack123 commented Jul 11, 2023 • edited

cvrunmin commented Jul 9, 2023 •

edited

`utflib` api

`shell.unicode`, `edit.unicode`, `lua.unicode` settings

`TermMethods.write` and `TermMethods.blit` functions

`char` and `paste` events

`read` function

Andrew-71 commented Jul 9, 2023 •

edited

MCJack123 commented Jul 11, 2023 •

edited