Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved unicode wcwidth support #1001

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

cgull
Copy link
Member

@cgull cgull commented Aug 31, 2018

This is a revised version of #949.

Changes from #949 include:

  • Implement East Asian Width narrow/wide support. Automatically enable in C/J/K locales, and add --eaw-wide switch.
  • Transmit and mostly store width tables as a list of runs, greatly reducing storage needed. Also properly manage this width state in a class, rather than ad-hoc global state.
  • Add an edit_widths.py script for merging/diffing width tables.
  • Add a hold-session capability for a future day when the diff table is too large to fit in a single packet.
  • The reference table is still Unicode 9, but the default table is now Unicode 11. (The reference table will be set to Unicode 11 before being committed to Mosh.)
  • General cleanup.

Notes:

I did a bunch of testing with various formats to see what was most space efficient. Text proved more efficient than protobufs, lists of run lengths were better than lists of offsets. The 9->11 overlay is 946 bytes uncompressed and 350 bytes gzipped standalone (actual compression will be something slightly different in Mosh packets).

Changing the character width tables is a change in the state of the terminal emulator. I see two ways to handle that: either set the tables before the user session is allowed to start, or fully integrate the character width tables into the terminal state machine. I chose to set the tables at startup. I think to fully integrate the state, we would have to send the overlay table to the server, which would then have to send it (or references to the changes) back to the client to keep it in sync. Messy.

With a small overlay, nothing extra needs to be done to Mosh to hold back the user session-- a while ago, I added some functionality on the server to wait until the first message is received. But if the overlay ever gets so large that we decide we need to send it in pieces in multiple messages, then we need to add some functionality for that. That's what the hold-session commits are for. I lean towards including this as currently-unused functionality so it's available for future versions of Mosh, but I'm not sure about that yet.


int ChWidth::chwidth( wchar_t wc ) const
{
if (static_cast<size_t>( wc ) > widths.size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>=.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that.

git submodule init
git submodule update

ucd/PropList.txt ucd/UnicodeData.txt ucd/EastAsianWidth.txt: UCD.zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This target will run multiple times concurrently and race with itself when building with -j.

unzip -o -d ucd UCD.zip
ln -fs ucd/PropList.txt ucd/UnicodeData.txt ucd/EastAsianWidth.txt .

eaw_narrow_map eaw_wide_map: ucd/PropList.txt ucd/UnicodeData.txt ucd/EastAsianWidth.txt edit_widths.py make_widths.py google-libapps/libdot/third_party/wcwidth/ranges.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixed in the rebase coming momentarily.

@andersk
Copy link
Member

andersk commented Aug 31, 2018

Transmit and mostly store width tables as a list of runs,

But they’re still expanded out to vector<unsigned char> at runtime. Did you decide that the extra 1.1 MB of memory per session is worth the performance improvement over binary search? That’s not necessarily an insignificant amount of memory to a phone or a virtualized server.

This code adds a system-independent Unicode widths table to Mosh, and
adds a scheme for the client to propagate local configuration to the
server.

We also generate character width tables from Unicode data.  This
brings in Google libapps as a Git submodule.

This uses Unicode 9.0 as the base map for demo purposes.  The default
map is 11.0.

It also adds an --eaw-wide option to select wide/narrow for East Asian
Ambiguous Width characters.
This hold/release is unnecessary because mosh-client always sends all
info needed for the session in its first message, and mosh-server
holds the session until after receiving the first message.  But this
does test the server-side functionality somewhat.
@cgull
Copy link
Member Author

cgull commented Sep 1, 2018

No, I've not decided anything about the final implementation of the actual lookup, or done any implementations/benchmarking/etc yet. Better seemed the enemy of good enough here, and I wanted to put this out for discussion/review to get the UI and protocol sorted out; a better implementation of chwidth() can easily be done later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants