Improved Unicode character width support #949

cgull · 2017-12-06T00:31:13Z

This is a first draft of flexible Unicode character width handling for Mosh. It's not complete, and I'd love to get some comment on this.

There's two parts to this:

Mosh itself gets Unicode tables, a chwidth() function to replace wcwidth(), and code to load Unicode tables or partial overlays in mosh-client and transmit them to mosh-server.
Code to generate new Unicode tables is in src/unicode. This only needs to be run when a new version of Unicode is released, and is not part of Mosh's normal build infrastructure. A developer will need to run it and commit the resulting changed table, once a year or so, following Unicode.org's release schedule. This code was also used to generate the tables in this pull request with a caveat noted below. We depend on Google's libapps, which has some code to generate character width tables in Javascript, which is used in hterm. (Many thanks to @vapier for doing this hard work of determining what a Unicode widths table for terminals should be in the first place, and for taking a small change that allows us to stand on his shoulders.)

How This Works

This code adds two fixed Unicode tables to Mosh: a reference table, which will never change after its initial introduction to Mosh, and a default table, which will be updated with each new Unicode release. Additionally, the user can overlay the default table with changes for some characters (like making East Asian Width Ambiguous characters wide instead of narrow), or replace the default table entirely with a complete table.

At startup, mosh-client creates a working table that is a combination of the default table, and whatever overlay/table the user has loaded. It uses this to determine character widths locally. It also compares the reference table and the working table to create an overlay with the difference between them. It sends this overlay to mosh-server, which applies the overlay to its copy of the reference table to create a working table that is the same as the one on the client. This overlay table is compressed before being added to a Message, and then the entire Message is compressed before being sent to the server. This double compression results in a very small growth in that initial Message.

My plan is that at initial release, the reference table will be generated from the Unicode 10.0.0 data files, and until Unicode 11 is released, the default table will actually be exactly the same. But for development and illustration, currently the reference table is Unicode 9.0 and the default table is 10.0.0. For this pairing, the initial client-to-server message only grows about 43 bytes with the addition of the overlay table. Since the Unicode organization keeps adding emoji, this differential will grow, but my hope is that it will still remain below the size of a Mosh-MTU packet for quite a while.

In this initial implementation, the in-core tables, the messages from client to server, and the user's custom files are all exactly the same format: they are a string of 1114112 bytes or less, one character for each Unicode code point. That character may be '0', '1', or '2' to represent a character width, '-' to represent an illegal code point, or (in an overlay table) '=' means "take this character from the base table". Nothing says that any of these objects need to be this format, or the same format as one of the others. It is a trivial format to parse for file input, and the extremely simple format is amenable to being compressed twice by zlib. But I do think we need to come up with something better for the fixed tables stored in the executable, and the working table constructed at runtime-- perhaps a list of runs for the fixed tables, and a two-level table for the runtime lookup (as many wcwidth implementations do).

Problems I think this helps solve:

The problem of mismatching Unicode width maps from differing wcwidth() implementations on client and server, and mismatching with the terminal emulator's width map.
Adding up to date width maps, and keeping them up to date.
Going forward, having up-to-date width maps even on old servers with old distros and binaries.
For older, already-existing versions of mosh-server with whatever width map the system gave them, it's possible to configure a client with a table that matches.
Adding an East Asian ambiguous width characters switch usable at runtime.
User configurability for private use area characters (Powerline).
SOFT HYPHEN. Some terminal emulators print them, some don't.
Small/old systems without locale and/or Unicode support. They can just send a map that only supports ISO-8859, the first 256 bytes.
Since there's no standard width map for character terminals, whatever we do will be wrong for somebody-- but users can load whatever works for them.
Integrated Mosh clients like Blink or Mosh-for-Chrome can load a table that exactly matches their terminal emulator's table.

Issues:

Documentation/comments in source code is a bit thin.
No utilities to merge/delta chwidth table files (I have some Perl, want to convert to Python).
I envision an --eaw-is-wide flag and/or automatic detection from locale variables in mosh, I haven't coded that up yet.
With a little more work (basically adding a utf8_to_utf32() and utf32_to_utf8()) we can eliminate all of Mosh's dependencies on libc locale code. This would improve portability, and allow mosh-server to merely warn of locale/charset issues on startup instead of terminating with an error. This would also allow ripping out some of the cruft to work around slow libc locale handling.
No effort at size or time optimization yet. The binaries bloat from 300KB to 7MB on my Mac.
src/unicode/Makefile.am is a barely-working, half-broken mess. That functionality needs to be a bit better integrated into autoconf/automake too, and I'm not sure how the Git submodule should be handled.
We need to define exactly what the reference and default chwidth tables should be.
Figuring out how this might integrate with Blink, Mosh-for-Chrome, etc. would be nice. @rpwoodbu, @carloscabanero, your comments will be greatly appreciated.

Requests:

Comments on what people need from Mosh to make Unicode work better for them.
Comments on the design/implementation of this pile of stuff.
Testing! I've barely used this in any kind of Unicode-heavy environment. Remember, you will need this on both client and server, and this is very much experimental-- this functionality is guaranteed to break on or before merge to Mosh master.
Comment from other developers in the Mosh ecosystem who use our code: @rpwoodbu, @carloscabanero and anyone else interested.

This code adds a system-independent Unicode widths table to Mosh, and adds a scheme for the client to propagate local configuration to the server.

This brings in Google libapps as a Git submodule.

andersk · 2017-12-06T01:14:17Z

Data structure suggestion for low space usage: a sorted array of (min codepoint, chwidth) pairs, where each entry represents the half-open interval from its codepoint to the next entry’s codepoint, would have just 1835 entries presently. It can be queried in logarithmic time with binary search, diffed by sorted set subtraction, and patched by sorted merging.

Are we planning to do anything to mitigate terminal desynchronization on wide characters that might now be supported by Mosh but not the terminal?

cgull · 2017-12-06T02:49:07Z

@andersk: yes, that's a fine candidate for the fixed tables and the file storage. However, when I did my performance work, I plugged in the Markus Kuhn wcwidth() implementation to see how it did. That's a straightforward binary-search implementation on a list of [base, width] as I recall. It was relatively slow (though I don't remember exact details), so I doubt it's the right choice for chwidth(). I see some implementation/benchmarking of alternatives in my near future.

I haven't thought at all about desynchronization. As I see it, you'd either have to reposition the cursor on pretty much every cell, or you'd need to maintain a map of codepoints that you think might be desynchronized (say, every codepoint that's not in Unicode 3.0), and reposition only after those. My thinking more leans towards getting the Unicode width mapping up to date, assuming that newer characters that the terminal doesn't support are relatively rare, and giving the user a way to set the terminal's exact mapping.

Speaking of which, there's an opportunity to coordinate between Mosh and terminal emulators on this problem. If the terminal emulator could pass a widths table to mosh-client, it could DTRT. How, though? I think that blob is larger than you'd want to stuff into an environment variable.

@keithw has mused on the idea of coordination between various members of the character-cell-terminal community (ncurses, emacs, screen, tmux, mosh, terminal emulators, etc) before.

cgull · 2017-12-06T21:22:10Z

I misremembered-- the Markus Kuhn wcwidth() is a complicated conditional that usually does binary searches on two tables and (for CJK and other high characters) executes a conditional with about 20 comparisons. A binary search on a single table of (base-key,value) pairs is definitely better than that.

And maybe we can do even better than binary search: an optimal binary search tree, or some heuristic approximation of one, might be useful. The heuristic might be something like:
ASCII printables weight 1
CJK core glyphs weight 2
ISO8859-1 printables weight 3
This might not reduce comparisons enough to be significantly better than simple binary search, though.

keithw · 2017-12-07T09:41:42Z

This is definitely one of the top complaints about Mosh today, so, thank you x1000 for taking this on.

I wonder how you might feel about simplifying this slightly in a way to remove the protocol and negotiation parts, at some cost to correctness but a benefit in simplicity and predictability (and protocol support burden). What would be your views on this kind of "dumb" design?

We make our own wcwidth table (just like what you have now) and update it on a timely basis. We stop using the libc wcwidth.
mosh-client and mosh-server ship with a version of the table compiled in.
Optionally, the user can give a command-line argument to mosh-client and mosh-server to make them read a substitute table out of the filesystem.
We supply users with a script that generates the width table from their own local terminal, probably by just printing every Unicode scalar value to the screen and measuring how many columns the cursor advances.

This seems to solve 95% of the problems that users have today, and even fixes the problem of not knowing what the local terminal is going to do (because the user can reverse-engineer a width table out of their local terminal and then use it in mosh-client and mosh-server if they want). It avoids having to come up with a communications format for the width table, having to negotiate width tables between clients and servers, or having to standardize on a reference width table that we would be required to honor forever as part of the protocol. I guess I'm asking if you think the incremental benefit supplied by that part is going to be worth the burden, or if we can get away with the worse-is-better approach here.

keithw · 2017-12-07T09:43:16Z

We could even host the latest Unicode width table at a well-known location (mosh.org/something) and tell users to wget it from us if they want their mosh to support the latest fall emoji...

cgull · 2017-12-07T22:48:48Z

I don't see all that much protocol burden and there is no actual negotiation. You are aware that a property of protobufs is that the receiver ignores (but correctly skips) unknown fields, right? In this code, a new client always sends the table as new protobuf fields, and a new server knows about the fields and uses them. An old client + new server results in each using their own table (and this is no worse than existing mosh behavior), and a new client + old server results in the server ignoring the unknown protobuf fields, and each using their own table (and this is no worse than existing mosh behavior).

(Side note on protobufs: If you construct two .proto files that use disjoint ranges of IDs, you can put two different messages in the same bytestream, feed it to the two parsers for each message, and each parser will correctly read its own message. I've contemplated this as a way to add independent new messages into our existing protocol.)

Apart from src/unicode and the tables in src/utils/chwidth_tables, this adds up to around 300 lines of new text in Mosh. File and option handling is a significant part of that (in mosh.pl and mosh-client.cc). Having mosh-server load from a file with an option might actually be more complex code-wise than sending a message to the server, and it is certainly more complex for the user.

One thing I really want to address with this is version skew between client and server. It's fairly common that a user is running a fairly current version of the client, but an older server, because they're using an enterprise distro or in-house distro. If we do the simple approach, then we still have a problem of mismatched width tables for the naive user. This approach ensures that even the naive user will always have matching tables within Mosh, as long as their client and server both have this feature.

It also allows a client to automagically send a width table matching the terminal, if it knows enough about the terminal.

My biggest concern with this implementation is that in cases where the user constructs a map that diverges in a complex way from the reference table, it might be too large for a single UDP datagram. Also, the client would now send bigger datagrams than it previously has (I'd guess it's currently rare for a client message to contain more than 100 bytes of data, even in situations where many User states back up). The most divergent overlay I've constructed so far is an ISO-8859-only map, where the first 256 codepoints are unaltered, and the entire rest of the Unicode codespace is blanked out with -. The diff generated from the reference table was around +1450 bytes (but there is a way to work around that particular case-- send a complete map instead of a diff, if the resulting object after compression is smaller). We could also send the client's outgoing MTU limit to something lower to help avoid VPN and other truncated-MTU issues in that direction.

As for your point 4, yes, this implementation can do that. I actually have some scripts for that which didn't make it into this PR.

Given the simple and compatible extension of the protobuf message, I thought all this was worth it. But you have more knowledge about Mosh usage/implementations than I do. Do you see a general problem or specific Mosh implementation that might get unhappy with this feature's implementation? There are at least two Mosh reimplementations that we know of. But they're both client-only, and since the server does nothing different here, they should be unaffected until they decide to add this feature. Are there any Mosh server reimplementations?

keithw · 2017-12-08T01:27:02Z

I'm inclined to defer to you here.

The things that make me uncomfortable about making protocol changes is that we're sort of at the mercy of the Unicode committee, and it sounds like we end up with an eternal dependency on (and canonizing of) Unicode 10, and then we have to send a diff between that and whatever crazy version of Unicode gets deployed every year for eternity. If that gets big in 6 years, we could be unhappy. (Or have to invent a flow-control scheme...)

Furthermore, as you say a user could construct a pathological width table, which we might not be able to send at all (at least not without inventing some sort of flow-control scheme to pay it out slowly).

If you think we can solve those issues, and you want to go whole-hog, I'm okay with going that way. Maybe we should just buckle down and do some flow control, i.e. make sure that even if the wcwidth diff is very large, it will be added to the synced object piecemeal (so only a small segment of the diff is outstanding at any given time).

cgull added 4 commits December 5, 2017 17:14

unicode-later-combining.test: Document slightly.

9670b3b

Add chwidth code, replace wcwidth with it.

a9ef92a

This code adds a system-independent Unicode widths table to Mosh, and adds a scheme for the client to propagate local configuration to the server.

devel debug logging: report sizes on chwidth overlays

9e920ed

Generate character width tables from Unicode data.

0511380

This brings in Google libapps as a Git submodule.

yury mentioned this pull request Jun 6, 2018

Broken unicode subscripts, unusable mosh blinksh/blink#524

Open

cgull mentioned this pull request Aug 31, 2018

Improved unicode wcwidth support #1001

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Unicode character width support #949

Improved Unicode character width support #949

cgull commented Dec 6, 2017

andersk commented Dec 6, 2017 •

edited

cgull commented Dec 6, 2017

cgull commented Dec 6, 2017

keithw commented Dec 7, 2017

keithw commented Dec 7, 2017

cgull commented Dec 7, 2017

keithw commented Dec 8, 2017 •

edited

Improved Unicode character width support #949

Are you sure you want to change the base?

Improved Unicode character width support #949

Conversation

cgull commented Dec 6, 2017

How This Works

Problems I think this helps solve:

Issues:

Requests:

andersk commented Dec 6, 2017 • edited

cgull commented Dec 6, 2017

cgull commented Dec 6, 2017

keithw commented Dec 7, 2017

keithw commented Dec 7, 2017

cgull commented Dec 7, 2017

keithw commented Dec 8, 2017 • edited

andersk commented Dec 6, 2017 •

edited

keithw commented Dec 8, 2017 •

edited