Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype: unicode string support #1517

Draft
wants to merge 12 commits into
base: mc-1.19.x
Choose a base branch
from

Conversation

cvrunmin
Copy link

@cvrunmin cvrunmin commented Jul 9, 2023

This aims to address the issue #860 about reading and writing unicode character into terminals.

This pull request mostly adapt the first route in the discussion ("separate versions of methods for unicode"). (Edit: no longer valid since the commit at 12 July.)

Additions

  1. utflib api

    This api provides a UTFString "class" that wraps a utf8-encoded byte string and act as a normal string. Functions that are provided in the standard string library, except string.dump, string.pack, string.packsize, string.unpack, are also provided in UTFString. Users can use this to adapt unicode strings into their old system painlessly. If users want to get Latin-1 string from UTFString, they can use UTFString:toLatin(). Otherwise tostring will return the backend byte string.

    Besides UTFString, the module also exports the following functions:

    1. fromLatin(str): consider the string as fully Latin-1 and convert it into utf8. Such function is provided as UTFString(str) will consider the string as already utf8-encoded, and only consider invalid byte subsequences as Latin1-encoded and convert them.
    2. isUTFString(v): return true if v is a UTFString.
    3. wrapStr(str): wrap a lua string so that normal string can be compared with unicode string.
    4. isStringWrapper(v): return true if v is a string wrapper from wrapStr(str)
  2. shell.unicode, edit.unicode, lua.unicode settings

    New settings allows shell, edit and lua programs to receive and print unicode strings. Such settings will not affect other programs, especially user-defined programs.

Changes

  1. Terminal Font Texture now supports unicode characters by dynamically baking them. Currently it simply uses GNU Unifont provided in vanilla Minecraft (also known as Legacy Unicode, uniform, "Force Unicode Font" options)
    • Monitor in TBO drawing mode currently does not support it as it involves changing shader code.
  2. TermMethods.write and TermMethods.blit functions

    Now they accepts UTFString properly, and do not write "table: 0x??????" on screen.
    • Current value check UTFString via duck test, which do not look really nice, and seems to have strong performance impact on edit program when unicode text presents.
  3. char and paste events

    Now it also send utf8-encoded string as the second parameter
  4. read function

    Now it accepts _bReadUnicode as the 5th argument, indicating whether it should take UTFString when true, or a normal string otherwise.

Roadmap

  • Drafting apis
  • Confirming api designs
  • Writing test script <-- send help - I'm not familiar with designing test cases

Edit

2023-07-12: merge separated unicode modules/functions back to their normal variant to reduce code duplication hell.

@9551-Dev
Copy link
Contributor

9551-Dev commented Jul 9, 2023

tbh i do not believe that this has any chance of being merged into CC:T, best of luck but i feel like a lot of features like this have been rejected before because it would conflict with the mods "feel"

@SammyForReal
Copy link
Contributor

I really hope that this can find some kind of compromise, because having Unicode support for a terminal kinda is something you'd expect. And it would make stuff so much easier.

@Andrew-71
Copy link
Contributor

Andrew-71 commented Jul 9, 2023

Personally fully in support of such addition. This is obviously an enromous change, but I believe the current status quo of only having latin + a few extra chars creates a pointless barrier of entry for people from different cultures and is not the way to go forward, even if this takes a while to polish out. I think the potential to slightly "break the feel", as dev1955 pointed out, is worth it to allow people unfamiliar with english to use the mod.

@9551-Dev
Copy link
Contributor

9551-Dev commented Jul 9, 2023

Personally fully in support of such addition. This is obviously an enromous change, but I believe the current status quo of only having latin + a few extra chars creates a pointless barrier of entry for people from different cultures and is not the way to go forward, even if this takes a while to polish out. I think the potential to slightly "break the feel", as dev1955 pointed out, is worth it to allow people unfamiliar with english to use the mod.

I dont think the rom even supports multiple languages so it sounds kinda pointless in making it more accessible for newbies
Also you still have to use Latin for Lua

Im not against this change as i use older versions anyway but this still feels too drastic

@Andrew-71
Copy link
Contributor

I dont think the rom even supports multiple languages so it sounds kinda pointless in making it more accessible for newbies
Also you still have to use Latin for Lua

I guess I worded this a bit poorly, I meant not that newbies would be able to set ROM to their language or code in it, but that non-technical users could potentially interact with CC programs others made in their native tongue e.g. a shop or a dashboard

@MCJack123
Copy link
Contributor

O_o That's a lot of new functions to add and import...

@SquidDev
Copy link
Member

Thank you for looking into this. I realise this is a bit of a pain, but I think it probably makes sense to do this work in two stages/two PRs:

  • Internal changes (rendering, Terminal, anything which isn't exposed to user-code): This is quite a complex change in itself, but I think what needs doing is (relatively) well understood. A couple of quick comments on what's here already:

    • We need to decide on what unicode features to support. I think my feeling is we should support double-wide/full-width forms, but not RTL or multi-codepoint graphemes (so anything involving ZWJ). That consistent with what Minecraft does (and supporting RTL in a terminal seems pretty a little terrifying).

    • It might be worth looking at alternative fonts to Minecraft's - as you mentioned in How to provide Unicode character set support for CCT? #860, the glyph sizes are different, so things may look a little weird. We might be able to do something using a mixture of X11's font (they have a 6x9 variant), Monocraft and one of Bitfont's, with Unifont as a fallback.

    It is going to be hard to maintain consistency with CC's existing font (after all, there's very few CJK characters which are recognisable at a 6x9 resolution!), but we can hopefully reduce the dissonance.

  • User-facing changes (so TermMethods/window, then expanding out from there): This is going to be much harder to get right, and I suspect take several iterations.

    I think the current implementation confirms my worst fears from How to provide Unicode character set support for CCT? #860 (comment), in that you end up duplicating a tonne of code.

    It might be worthwhile putting together another version of the Lua-side changes based on a cut-down version of Jack's approach (How to provide Unicode character set support for CCT? #860 (comment)). If we convert UTFString to a Java-side userdata type, it should be possible to handle this type inside the Java methods too.

    I'm not entirely sure this is the right option either - there's still some questions about how we want to handle receiving strings (both in return values, and from events - I'm not convinced duplicating events is very nice). Like I said, this is going to require some iterations to get right :).

@cvrunmin
Copy link
Author

Quick response of point 1:

We need to decide on what unicode features to support. I think my feeling is we should support double-wide/full-width forms, but not RTL or multi-codepoint graphemes (so anything involving ZWJ). That consistent with what Minecraft does (and supporting RTL in a terminal seems pretty a little terrifying).
Agree that we should support double-wide forms. Multi-codepoint graphemes screw things so much especially in trimming strings, so we could forget it for now.

As for the cloning hell in point 2, it is because we cannot distinguish when we expect a latin1 string and when we expect a utf8 string. I believe using UTFString userdata could help eliminating most, if not all of the duplication. However it also need a revamp on CC:T so that it support userdata. For example CobaltLuaMachine:toObject completely ignores userdata type.
Still it should be better than making something like this:

term.setUtf(true)
term.write(...)
term.setUtf(false)

that could ruin the subsequent calls if someone forgets to restore state.

For duplicating event issue, would it be better if we send two params for char and paste, one normal string and one UTFString? The ambiguity on codepoint 128-255, whether is should be sent with Latin-1 or UTF-8 encoding concerns me and sending both of them comforts me.

@MCJack123
Copy link
Contributor

MCJack123 commented Jul 11, 2023

I'm not entirely sure this is the right option either - there's still some questions about how we want to handle receiving strings (both in return values, and from events - I'm not convinced duplicating events is very nice).

FWIW, my approach for events in my test was to use the same event name, but adding a second parameter with the UTFString variant. This allows programs that are Unicode-aware to grab the Unicode version without needing to duplicate events.

I didn't do a good job at describing every change I made in my test back then, but you could take a look at the ROM patches for inspiration. I still think this is the most elegant solution, even if it ends up adding a bunch of extra Unicode options to functions.

Also, I don't really like the idea of making people interact with the UTF-8 representations of strings directly. It's really easy to slip up and end up putting it into a normal string function, which would destroy the codepoints and/or not function correctly. It's good that there's a usermode library to help, but IMO we shouldn't be exposing the raw encoding data to users unless they specifically ask for it (e.g. an encode method/serialization).

@SquidDev SquidDev added enhancement An extension of a feature or a new feature. area-Documentation Improving/fixing documentation area-CraftOS This affects CraftOS, or any other Lua portions of the mod. area-Core This affects CC's core (the Lua runtime, APIs, computer internals). area-Minecraft This affects CC's Minecraft-specific content. and removed area-Documentation Improving/fixing documentation labels Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Core This affects CC's core (the Lua runtime, APIs, computer internals). area-CraftOS This affects CraftOS, or any other Lua portions of the mod. area-Minecraft This affects CC's Minecraft-specific content. enhancement An extension of a feature or a new feature.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants