Add multibyte support #25

ozancaglayan · 2012-01-23T17:14:28Z

Current code doesn't have support for multibyte strings, e.g. strings having unicode characters beyond ASCII range. The column shifts for refreshLine are calculated using strlen() which returns 2 instead of 1 for a 2-byte character like 'Ş' in Turkish.

The library should use mbstowcs() or other functions to get the number of characters instead of number of bytes for column processing (up, down arrows, erasing a character, etc.).

And also as those functions are LC_CTYPE dependent, either you or the applications using linenoise should call setlocale(LC_ALL, "") to set the application's locale to the system locale.

Thanks.

ozancaglayan · 2012-01-23T17:15:21Z

See: http://www.cl.cam.ac.uk/~mgk25/unicode.html

msteveb · 2012-01-23T22:02:51Z

Take a look at my fork, https://github.com/msteveb/linenoise, which has support for utf-8

ozancaglayan · 2012-01-23T22:45:20Z

Do you really need all those functions? I'm not quite familiar with the stuff but I easily fixed some of the weird problems by using mbstowcs() instead of strlen() where the length of the string is assumed equivalent to the number of characters in the string. But I couldn't find way to fix deleting of wide characters with backspace..

msteveb · 2012-01-23T23:50:14Z

The approach here is to avoid any reliance on system support for utf-8. For example, I have systems running uClibc without locale support which can still happily run a utf-8 console over a serial port. Of course you are welcome to take a different approach.

jasom · 2013-12-03T22:01:22Z

I have a similar issue; I tried out line-noise for a shell implementation. If I want coloured prompts, the escape-codes end up being included in the length calculation.

A simpler, easier fix is to eirther:

allow specifying the length of the prompt yourself.
use terminal commands to extract the position of the cursor after outputing the prompt (not sure if this is possible)

lilydjwg · 2014-03-12T09:25:32Z

I find this from mongo shell's code. I'm always annoyed by more and more CLI tools (mongo, redis-cli, node)) I use whose cursor moves weiredly when there are multibyte characters. I don't know if the others are using linenoise or something else, but I'd like to see this get fixed :-)

jasom · 2014-03-14T22:13:46Z

I've made a modified linenoise that lets you specify the width yourself, so it's extra work for the application, but at least possible; I've been using it for about 3 months with no problems. I'll turn it into a pull request, perhaps.

yhirose · 2015-10-26T00:15:18Z

'utf-8 support' branch on my fork fixed the following UTF-8 problems that appear in the latest linenoise version 1.0:

Multi-byte characters: ö (U+00F6)
Multi-code characters: ö (U+006F U+0308)
Wide characters: 日本語 ('Japanese')
Prompt text including the above characters and ANSI escaped colored text.

I first tried https://github.com/msteveb/linenoise. But it is not based on the latest linenoise which supports the fantastic multiline mode. Also it doesn't support CJK wide characters and multi-code characters...

antirez · 2015-10-26T07:57:53Z

Hello, I'm thinking about going the following route with this issue:

Use @yhirose as a reference in order to check where the C plain string functions should be substituted by multi-byte aware ones.
Export an API that allows linenoise user to set alternative functions for string length calculations. Set the function to the plain C functions as default.
Include @yhirose code as a separated file that you can add to your application, calling the linenoise new functions to set the length functions, in order to have multi-byte support.

This way we obtain that linenoise simplicity remains almost untouched, but optionally it is both possible to support multi byte chars both with C++ functions, other user provided functions different from standard ones, or the ones included in linenoise itself if your project is in C and you don't want to rewrite what @yhirose already wrote again and again.

Makes sense to you? Thanks.

yhirose · 2015-10-27T00:36:42Z

@antirez, Thanks for paying attention to the multi-byte code users! The idea that you presented totally makes sense to me. I am even happier because if the linenoise library itself can give the extensibility, we could easily add other multi-byte encoding support.

As you can see in my fork, the most important concept for enabling 'multi byte' support is to make a clear distinction between 'byte position/width' in text buffer and 'column position/width' on screen. Here are some examples in UTF-8:

あ (U+3042): E3 81 82 (3 bytes): Wide (2 column width)
ö (U+00F6): C3 B6 (2 bytes): Narrow (1 column width)
ö (U+006F U+0308): 6F CC 88 (3 bytes): Narrow (1 column width)

Once we come to know the difference, it's pretty easy to handle multi-byte code correctly. You can grasp the idea from changes in the 1st commit. I applied the same principle to prompt text in the 2nd commit as well.

The only place where we need to be careful is the multiline mode handling code. For instance, when the last character is wide and there is only 1 column left on the current row, that wide character doesn't fit the remaining space. So the wide character must be displayed at the beginning of the next line. This code handles it.

One more thing that I did is to skip all the ANSI escape sequence characters when calculating column position/width in the 3rd commit. This change enables us to use color in the prompt text.

I am really excited to see the new API in the near future. Please let me know if you have any questions on this matter. I am sure that you will do a fantastic job!!

yhirose · 2015-10-29T16:46:26Z

After researching more about dependencies between the linenoise code and the UTF-8 encoding code according to your design goal, I realized that only three functions are needed when adding other encoding support.

Based on the research, I have updated my branch. Here is the diff between the linenoise head and the utf8-support branch. As you could see there, I got rid of all UTF-8 specific code completely from linenoise.c and put them into encodings/utf8.h and encodings/utf8.c. Also I added one experiment API called linenoiseSetEncodingFunctions on linenoise.h, so that users could set their own set of encoding functions. I confirmed all the functionalities still work.

Here is a snippet of my current experimental API:

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

void linenoiseSetEncodingFunctions(
    linenoisePrevCharLen *prevCharLenFunc,
    linenoiseNextCharLen *nextCharLenFunc,
    linenoiseReadCode *readCodeFunc);

linenoisePrevCharLen and linenoiseNextCharLen return byte length as the return value, and set column length to col_len parameter. linenoiseReadCode reads bytes into buf, and convert the bytes and set a meaningful character code for the encoding to c parameter.

If users don't call linenoiseSetEncodingFunctions, it'll end up calling default implementations. They simply handle one byte as a character.

Hope the post will be helpful when you design the new encoding API. I am really looking forward to it!!

antirez · 2015-11-08T09:26:25Z

@yhirose that's a fantastic work!!! :-) I'm going to check the code and merge it. Thank you for this.

henriqueleng · 2016-01-28T20:23:30Z

Not merged yet?

dumblob · 2016-06-25T10:27:44Z

@antirez any progress on merging it?

yhirose · 2016-06-28T01:56:43Z

I have modified my fork (https://github.com/yhirose/linenoise/tree/utf8-support) to catch up with the recent changes made in the original linenoise such as 'hints' feature.

Sonophoto · 2016-06-28T03:07:39Z

Thank you very much @yhirose. You have made good code better! and my
job easier!

@Sonophoto

On Mon, 27 Jun 2016 18:56:45 -0700, yhirose wrote:

   I have modified my fork

(https://github.com/yhirose/linenoise/tree/utf8-support) to catch up
with the recent changes made in the original linenoise such as 'hints'
feature.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

yhirose · 2016-10-25T00:33:50Z

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 9.0.

aleclarson · 2018-02-21T00:08:14Z

@antirez Will you have free time in the near future to merge @yhirose's multi-byte support? Or should we switch https://github.com/hoelzro/lua-linenoise to use @yhirose's fork until then? ✌️

yhirose · 2018-10-15T07:07:48Z

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 11.0 and includes all the recent changes made in antirez/linenoise.

yhirose · 2019-07-10T02:53:54Z

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 12.1.

yhirose · 2020-04-24T17:53:09Z

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 13.0.

mcfriend99 · 2022-04-06T06:31:27Z

@yhirose can jgriffiths solution for Win32 support in #8 be merged into the utf-8 support branch?? Also, you may consider merging the UTF-8 support into your main branch or moving the project into a different repository. A lot of us use it!

yhirose · 2022-04-06T12:21:21Z

@mcfriend99, thanks for your suggestion, but I am not interested in merging the Win32 specific code into this branch. My intention of this patch is to make the current linenoise code UTF8 compatible with the smallest possible effort and keeping the original linenoise code structure as much as possible.

As for moving to main branch, I'll take a look at it.

mfikes mentioned this issue Aug 5, 2015

Strange behavior for Japanese characters [segmentation fault] planck-repl/planck#30

Closed

yhirose mentioned this issue Oct 26, 2015

wrong cursor position with non-ascii input yhirose/cpp-linenoise#1

Closed

carloscabanero mentioned this issue Jun 28, 2016

crash on backspace after unicode character blinksh/blink#49

Closed

hoelzro mentioned this issue Apr 26, 2017

Able to merge in UTF-8 supporting fork of Linenoise into our version? raku-community-modules/Linenoise#20

Open

Sonophoto mentioned this issue Feb 20, 2018

Non-printing characters mess up tab completion hoelzro/lua-linenoise#15

Closed

aleclarson mentioned this issue Feb 21, 2018

ANSI colors in prompt string #150

Closed

yhirose mentioned this issue Apr 19, 2020

Another implementation for utf8 support. #187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multibyte support #25

Add multibyte support #25

ozancaglayan commented Jan 23, 2012

ozancaglayan commented Jan 23, 2012

msteveb commented Jan 23, 2012

ozancaglayan commented Jan 23, 2012

msteveb commented Jan 23, 2012

jasom commented Dec 3, 2013

lilydjwg commented Mar 12, 2014

jasom commented Mar 14, 2014

yhirose commented Oct 26, 2015

antirez commented Oct 26, 2015

yhirose commented Oct 27, 2015

yhirose commented Oct 29, 2015

antirez commented Nov 8, 2015

henriqueleng commented Jan 28, 2016

dumblob commented Jun 25, 2016

yhirose commented Jun 28, 2016

Sonophoto commented Jun 28, 2016

yhirose commented Oct 25, 2016

aleclarson commented Feb 21, 2018

yhirose commented Oct 15, 2018

yhirose commented Jul 10, 2019

yhirose commented Apr 24, 2020

mcfriend99 commented Apr 6, 2022 •

edited

yhirose commented Apr 6, 2022 •

edited

Add multibyte support #25

Add multibyte support #25

Comments

ozancaglayan commented Jan 23, 2012

ozancaglayan commented Jan 23, 2012

msteveb commented Jan 23, 2012

ozancaglayan commented Jan 23, 2012

msteveb commented Jan 23, 2012

jasom commented Dec 3, 2013

lilydjwg commented Mar 12, 2014

jasom commented Mar 14, 2014

yhirose commented Oct 26, 2015

antirez commented Oct 26, 2015

yhirose commented Oct 27, 2015

yhirose commented Oct 29, 2015

antirez commented Nov 8, 2015

henriqueleng commented Jan 28, 2016

dumblob commented Jun 25, 2016

yhirose commented Jun 28, 2016

Sonophoto commented Jun 28, 2016

yhirose commented Oct 25, 2016

aleclarson commented Feb 21, 2018

yhirose commented Oct 15, 2018

yhirose commented Jul 10, 2019

yhirose commented Apr 24, 2020

mcfriend99 commented Apr 6, 2022 • edited

yhirose commented Apr 6, 2022 • edited

mcfriend99 commented Apr 6, 2022 •

edited

yhirose commented Apr 6, 2022 •

edited