Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong cursor position with non-ascii input #1

Closed
jsteemann opened this issue Oct 21, 2015 · 12 comments
Closed

wrong cursor position with non-ascii input #1

jsteemann opened this issue Oct 21, 2015 · 12 comments

Comments

@jsteemann
Copy link

Whenever the input contains non-ASCII characters that consist of multibyte characters, then the cursor position is wrong.
For example, when entering ö (an Umlaut character), the cursor will go to the right by 2 positions, but it should advance only one character. The reason for this is that the ö consists of two bytes in UTF-8, and entering this characters causes two calls to read() insidie linenoiseEdit() and the linenoise version used does not seem to be Unicode-aware at all.
So for UTF-8 multibyte characters it is broken, at least on all the Linuxes I tried.

@yhirose
Copy link
Owner

yhirose commented Oct 22, 2015

@jsteemann Thanks for the bug report. I have tested with the original linenoise library, and confirmed that it has also the same problem on my Mac OS X.
I'll try to find a solution for it. Of course, if you send me a pull request, it would be even better. :)

@jsteemann
Copy link
Author

@yhirose : yes, original linenoise is broken in this regard as well (maybe it was even intentional to reduce lines of code in linenoise).

I think it's quite complex to solve this issue properly. There is a fork of linenoise that supports UTF-8, but AFAIK it does not include all the multiline functionality from the original repository. I looked into all this a bit and think it would be quite a challenge to get these two repos together. Feel free to close this issue if you like.

@yhirose
Copy link
Owner

yhirose commented Oct 25, 2015

@jsteemann, I made a fork of the original linenoise and added UTF-8 support to the utf-support branch on my fork. I have tested with ö (U+00F6) and ö (U+006F U+0308) on my Mac OS Terminal. It also supports wide characters such as CJK characters like .

Here is the diff between the original code and mine:
antirez/linenoise@master...yhirose:utf8-support

I would really appreciate it if you could please test the branch on your Linux boxes. If you think it works on Linux, I'll port it to my cpp-linenoise as well (Since cpp-linenoise has to support Windows that doesn't support UTF-8 as default encoding in the command prompt, I have to port it carefully).

I have also posted my comment regarding this change on antirez/linenoise#25.

You gave me quite a challenge, but it was really enjoyable. :)

@jsteemann
Copy link
Author

I have tried the utf8-support branch.

Entering ö now first works fine, but there are issues afterwards.
Here's an example sequence of keystrokes that show the problem:

  • type m, ö, t, ö, r . So far, it's all good and the cursor is at the correct screen position.
  • now press backspace. The cursor jumps one position to the right instead of to the left. This is probably because the two ös are two-byte characters, so strlen("mötör") is 7 and removing the last byte means the remaining string has a length of 6 bytes. That's exactly where the cursor is then, it is six screen positions to the right instead of 6 bytes (in this case this would equal the 4 characters mötö).

@jsteemann
Copy link
Author

By the way, thanks for all the work you did!

I have also looked into what MongoDB did with linenoise. They have rewritten it in C++, and their version supports multiline-editing and UTF-8 and seems to work. They have made an extension to linenoise for handling Unicode that's licensed under AGPL. Here's the directory contained their version: https://github.com/mongodb/mongo/tree/master/src/mongo/shell

To try it out on Linux, one can simply use the files linenoise* and mk_* and glue them together. That produces a working linenoise with UTF-8 support and multiline editing. It should also work on Windows, however, I did not try this myself.

@yhirose
Copy link
Owner

yhirose commented Oct 26, 2015

@jsteemann , Thanks for taking time and testing the branch. I am also trying to reproduce the problem with m, ö, t, ö, r. But It works on my Ubuntu 15.10. Maybe some necessary steps are missing in my trial.

Here are the steps that I took:
(1) git clone https://github.com/yhirose/linenoise.git
(2) cd linenoise
(3) git checkout -b utf8-support origin/utf8-support
(4) make && ./linenoise_example.
(5) Changed to German keyboard, then type m ö t ö r. The cursor is located after the last r.
(6) Press backspace. It erased the last r and moved the cursor after the 2nd ö.
(7) Press backspace again. It erased the 2nd ö and moved the cursor after the t.

Here is a screen shot that I got after the step 7 took place.

image

Do I miss something to reproduce the problem? I would really appreciate it if you could give me whatever information which you think could be helpful. Thanks for your help again!

The MongoDB version looks very good. I haven't tested it yet, but the code handles UTF-8 encoding very well even including wide characters and composed characters (characters with diacritics). I am actually amazed with their code quality. Only my concern is that the version doesn't look like the original C-based linenoise code anymore, since they rewrote the original a lot with using C++. Of course I understand their circumstances, because MongoDB is a C++ project.

@jsteemann
Copy link
Author

I can confirm it works for me. I have probably been on the wrong branch before though I had intended to checkout the utf8-support branch. But as there have been no new commits on the utf8-support branch since Sunday, I really seem to have been on the wrong path. Sorry for that!

To summarize: the branch works fine with the above example I provided, and also with several other scripts I tried (Western European Hangul, Cyrillic, Chinese and some more Japanese texts I pasted from some tests we had).

あなたの素晴らしい仕事のためにありがとうございました

@yhirose
Copy link
Owner

yhirose commented Oct 29, 2015

Thanks for testing! When I port the UTF-8 support to cpp-linenoise, I'll post another message on this thread.

@jsteemann
Copy link
Author

Thanks!

@yhirose
Copy link
Owner

yhirose commented Nov 26, 2015

I have ported it. It works on Linux and Mac OS X at this point. When I have time, I'll try to support it on Windows as well.

@jsteemann
Copy link
Author

Thanks.
Just to let you know and share ideas: when we were looking for a working linenoise version with UTF-8 support a colleague of mine had also created another linenoise fork (https://github.com/arangodb/linenoise-ng).
That fork merged a recent linenoise version with multiline editing with some of the very nice editing feature extensions done by 10gen/MongoDB. In that fork, the UTF-8 support functionality from the 10gen/MongoDB fork was replaced with some other UTF-8 handling functions because of license issues. Instead it uses some other UTF-8 handling routines. It should compile on Linux, MacOS and Windows. It tries to be compatible to original linenoise API-wise, but it's not a header-only approach like your fork is.

@yhirose
Copy link
Owner

yhirose commented Nov 27, 2015

That is a really impressive work!!!

I am thinking of abandoning the current old codebase (antirez/linenoise, adoxa/ansicon, MSOpenTech/redis) in cpp-linenoise completely and remaking it based on this linenoise-ng in the future.

Thanks for sharing the good news again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants