-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved Unicode character width support #949
base: master
Are you sure you want to change the base?
Conversation
This code adds a system-independent Unicode widths table to Mosh, and adds a scheme for the client to propagate local configuration to the server.
This brings in Google libapps as a Git submodule.
Data structure suggestion for low space usage: a sorted array of (min codepoint, chwidth) pairs, where each entry represents the half-open interval from its codepoint to the next entry’s codepoint, would have just 1835 entries presently. It can be queried in logarithmic time with binary search, diffed by sorted set subtraction, and patched by sorted merging. Are we planning to do anything to mitigate terminal desynchronization on wide characters that might now be supported by Mosh but not the terminal? |
@andersk: yes, that's a fine candidate for the fixed tables and the file storage. However, when I did my performance work, I plugged in the Markus Kuhn I haven't thought at all about desynchronization. As I see it, you'd either have to reposition the cursor on pretty much every cell, or you'd need to maintain a map of codepoints that you think might be desynchronized (say, every codepoint that's not in Unicode 3.0), and reposition only after those. My thinking more leans towards getting the Unicode width mapping up to date, assuming that newer characters that the terminal doesn't support are relatively rare, and giving the user a way to set the terminal's exact mapping. Speaking of which, there's an opportunity to coordinate between Mosh and terminal emulators on this problem. If the terminal emulator could pass a widths table to mosh-client, it could DTRT. How, though? I think that blob is larger than you'd want to stuff into an environment variable. @keithw has mused on the idea of coordination between various members of the character-cell-terminal community ( |
I misremembered-- the Markus Kuhn And maybe we can do even better than binary search: an optimal binary search tree, or some heuristic approximation of one, might be useful. The heuristic might be something like: |
This is definitely one of the top complaints about Mosh today, so, thank you x1000 for taking this on. I wonder how you might feel about simplifying this slightly in a way to remove the protocol and negotiation parts, at some cost to correctness but a benefit in simplicity and predictability (and protocol support burden). What would be your views on this kind of "dumb" design?
This seems to solve 95% of the problems that users have today, and even fixes the problem of not knowing what the local terminal is going to do (because the user can reverse-engineer a width table out of their local terminal and then use it in mosh-client and mosh-server if they want). It avoids having to come up with a communications format for the width table, having to negotiate width tables between clients and servers, or having to standardize on a reference width table that we would be required to honor forever as part of the protocol. I guess I'm asking if you think the incremental benefit supplied by that part is going to be worth the burden, or if we can get away with the worse-is-better approach here. |
We could even host the latest Unicode width table at a well-known location (mosh.org/something) and tell users to wget it from us if they want their mosh to support the latest fall emoji... |
I don't see all that much protocol burden and there is no actual negotiation. You are aware that a property of protobufs is that the receiver ignores (but correctly skips) unknown fields, right? In this code, a new client always sends the table as new protobuf fields, and a new server knows about the fields and uses them. An old client + new server results in each using their own table (and this is no worse than existing mosh behavior), and a new client + old server results in the server ignoring the unknown protobuf fields, and each using their own table (and this is no worse than existing mosh behavior). (Side note on protobufs: If you construct two Apart from One thing I really want to address with this is version skew between client and server. It's fairly common that a user is running a fairly current version of the client, but an older server, because they're using an enterprise distro or in-house distro. If we do the simple approach, then we still have a problem of mismatched width tables for the naive user. This approach ensures that even the naive user will always have matching tables within Mosh, as long as their client and server both have this feature. It also allows a client to automagically send a width table matching the terminal, if it knows enough about the terminal. My biggest concern with this implementation is that in cases where the user constructs a map that diverges in a complex way from the reference table, it might be too large for a single UDP datagram. Also, the client would now send bigger datagrams than it previously has (I'd guess it's currently rare for a client message to contain more than 100 bytes of data, even in situations where many User states back up). The most divergent overlay I've constructed so far is an ISO-8859-only map, where the first 256 codepoints are unaltered, and the entire rest of the Unicode codespace is blanked out with As for your point 4, yes, this implementation can do that. I actually have some scripts for that which didn't make it into this PR. Given the simple and compatible extension of the protobuf message, I thought all this was worth it. But you have more knowledge about Mosh usage/implementations than I do. Do you see a general problem or specific Mosh implementation that might get unhappy with this feature's implementation? There are at least two Mosh reimplementations that we know of. But they're both client-only, and since the server does nothing different here, they should be unaffected until they decide to add this feature. Are there any Mosh server reimplementations? |
I'm inclined to defer to you here. The things that make me uncomfortable about making protocol changes is that we're sort of at the mercy of the Unicode committee, and it sounds like we end up with an eternal dependency on (and canonizing of) Unicode 10, and then we have to send a diff between that and whatever crazy version of Unicode gets deployed every year for eternity. If that gets big in 6 years, we could be unhappy. (Or have to invent a flow-control scheme...) Furthermore, as you say a user could construct a pathological width table, which we might not be able to send at all (at least not without inventing some sort of flow-control scheme to pay it out slowly). If you think we can solve those issues, and you want to go whole-hog, I'm okay with going that way. Maybe we should just buckle down and do some flow control, i.e. make sure that even if the wcwidth diff is very large, it will be added to the synced object piecemeal (so only a small segment of the diff is outstanding at any given time). |
This is a first draft of flexible Unicode character width handling for Mosh. It's not complete, and I'd love to get some comment on this.
There's two parts to this:
Mosh itself gets Unicode tables, a
chwidth()
function to replacewcwidth()
, and code to load Unicode tables or partial overlays inmosh-client
and transmit them tomosh-server
.Code to generate new Unicode tables is in
src/unicode
. This only needs to be run when a new version of Unicode is released, and is not part of Mosh's normal build infrastructure. A developer will need to run it and commit the resulting changed table, once a year or so, following Unicode.org's release schedule. This code was also used to generate the tables in this pull request with a caveat noted below. We depend on Google's libapps, which has some code to generate character width tables in Javascript, which is used inhterm
. (Many thanks to @vapier for doing this hard work of determining what a Unicode widths table for terminals should be in the first place, and for taking a small change that allows us to stand on his shoulders.)How This Works
This code adds two fixed Unicode tables to Mosh: a
reference
table, which will never change after its initial introduction to Mosh, and adefault
table, which will be updated with each new Unicode release. Additionally, the user can overlay the default table with changes for some characters (like making East Asian Width Ambiguous characters wide instead of narrow), or replace the default table entirely with a complete table.At startup,
mosh-client
creates a working table that is a combination of the default table, and whatever overlay/table the user has loaded. It uses this to determine character widths locally. It also compares the reference table and the working table to create an overlay with the difference between them. It sends this overlay tomosh-server
, which applies the overlay to its copy of the reference table to create a working table that is the same as the one on the client. This overlay table is compressed before being added to aMessage
, and then the entireMessage
is compressed before being sent to the server. This double compression results in a very small growth in that initialMessage
.My plan is that at initial release, the reference table will be generated from the Unicode 10.0.0 data files, and until Unicode 11 is released, the default table will actually be exactly the same. But for development and illustration, currently the reference table is Unicode 9.0 and the default table is 10.0.0. For this pairing, the initial client-to-server message only grows about 43 bytes with the addition of the overlay table. Since the Unicode organization keeps adding emoji, this differential will grow, but my hope is that it will still remain below the size of a Mosh-MTU packet for quite a while.
In this initial implementation, the in-core tables, the messages from client to server, and the user's custom files are all exactly the same format: they are a string of 1114112 bytes or less, one character for each Unicode code point. That character may be '0', '1', or '2' to represent a character width, '-' to represent an illegal code point, or (in an overlay table) '=' means "take this character from the base table". Nothing says that any of these objects need to be this format, or the same format as one of the others. It is a trivial format to parse for file input, and the extremely simple format is amenable to being compressed twice by zlib. But I do think we need to come up with something better for the fixed tables stored in the executable, and the working table constructed at runtime-- perhaps a list of runs for the fixed tables, and a two-level table for the runtime lookup (as many wcwidth implementations do).
Problems I think this helps solve:
wcwidth()
implementations on client and server, and mismatching with the terminal emulator's width map.mosh-server
with whatever width map the system gave them, it's possible to configure a client with a table that matches.Issues:
--eaw-is-wide
flag and/or automatic detection from locale variables inmosh
, I haven't coded that up yet.utf8_to_utf32()
andutf32_to_utf8()
) we can eliminate all of Mosh's dependencies on libc locale code. This would improve portability, and allow mosh-server to merely warn of locale/charset issues on startup instead of terminating with an error. This would also allow ripping out some of the cruft to work around slow libc locale handling.src/unicode/Makefile.am
is a barely-working, half-broken mess. That functionality needs to be a bit better integrated into autoconf/automake too, and I'm not sure how the Git submodule should be handled.Requests: