Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multibyte support #25

Open
ozancaglayan opened this issue Jan 23, 2012 · 23 comments
Open

Add multibyte support #25

ozancaglayan opened this issue Jan 23, 2012 · 23 comments

Comments

@ozancaglayan
Copy link

Current code doesn't have support for multibyte strings, e.g. strings having unicode characters beyond ASCII range. The column shifts for refreshLine are calculated using strlen() which returns 2 instead of 1 for a 2-byte character like 'Ş' in Turkish.

The library should use mbstowcs() or other functions to get the number of characters instead of number of bytes for column processing (up, down arrows, erasing a character, etc.).

And also as those functions are LC_CTYPE dependent, either you or the applications using linenoise should call setlocale(LC_ALL, "") to set the application's locale to the system locale.

Thanks.

@ozancaglayan
Copy link
Author

@msteveb
Copy link

msteveb commented Jan 23, 2012

Take a look at my fork, https://github.com/msteveb/linenoise, which has support for utf-8

@ozancaglayan
Copy link
Author

Do you really need all those functions? I'm not quite familiar with the stuff but I easily fixed some of the weird problems by using mbstowcs() instead of strlen() where the length of the string is assumed equivalent to the number of characters in the string. But I couldn't find way to fix deleting of wide characters with backspace..

@msteveb
Copy link

msteveb commented Jan 23, 2012

The approach here is to avoid any reliance on system support for utf-8. For example, I have systems running uClibc without locale support which can still happily run a utf-8 console over a serial port. Of course you are welcome to take a different approach.

@jasom
Copy link

jasom commented Dec 3, 2013

I have a similar issue; I tried out line-noise for a shell implementation. If I want coloured prompts, the escape-codes end up being included in the length calculation.

A simpler, easier fix is to eirther:

  1. allow specifying the length of the prompt yourself.
  2. use terminal commands to extract the position of the cursor after outputing the prompt (not sure if this is possible)

@lilydjwg
Copy link

I find this from mongo shell's code. I'm always annoyed by more and more CLI tools (mongo, redis-cli, node)) I use whose cursor moves weiredly when there are multibyte characters. I don't know if the others are using linenoise or something else, but I'd like to see this get fixed :-)

@jasom
Copy link

jasom commented Mar 14, 2014

I've made a modified linenoise that lets you specify the width yourself, so it's extra work for the application, but at least possible; I've been using it for about 3 months with no problems. I'll turn it into a pull request, perhaps.

@yhirose
Copy link

yhirose commented Oct 26, 2015

'utf-8 support' branch on my fork fixed the following UTF-8 problems that appear in the latest linenoise version 1.0:

  • Multi-byte characters: ö (U+00F6)
  • Multi-code characters: ö (U+006F U+0308)
  • Wide characters: 日本語 ('Japanese')
  • Prompt text including the above characters and ANSI escaped colored text.

I first tried https://github.com/msteveb/linenoise. But it is not based on the latest linenoise which supports the fantastic multiline mode. Also it doesn't support CJK wide characters and multi-code characters...

@antirez
Copy link
Owner

antirez commented Oct 26, 2015

Hello, I'm thinking about going the following route with this issue:

  1. Use @yhirose as a reference in order to check where the C plain string functions should be substituted by multi-byte aware ones.
  2. Export an API that allows linenoise user to set alternative functions for string length calculations. Set the function to the plain C functions as default.
  3. Include @yhirose code as a separated file that you can add to your application, calling the linenoise new functions to set the length functions, in order to have multi-byte support.

This way we obtain that linenoise simplicity remains almost untouched, but optionally it is both possible to support multi byte chars both with C++ functions, other user provided functions different from standard ones, or the ones included in linenoise itself if your project is in C and you don't want to rewrite what @yhirose already wrote again and again.

Makes sense to you? Thanks.

@yhirose
Copy link

yhirose commented Oct 27, 2015

@antirez, Thanks for paying attention to the multi-byte code users! The idea that you presented totally makes sense to me. I am even happier because if the linenoise library itself can give the extensibility, we could easily add other multi-byte encoding support.

As you can see in my fork, the most important concept for enabling 'multi byte' support is to make a clear distinction between 'byte position/width' in text buffer and 'column position/width' on screen. Here are some examples in UTF-8:

  • (U+3042): E3 81 82 (3 bytes): Wide (2 column width)
  • ö (U+00F6): C3 B6 (2 bytes): Narrow (1 column width)
  • (U+006F U+0308): 6F CC 88 (3 bytes): Narrow (1 column width)

Once we come to know the difference, it's pretty easy to handle multi-byte code correctly. You can grasp the idea from changes in the 1st commit. I applied the same principle to prompt text in the 2nd commit as well.

The only place where we need to be careful is the multiline mode handling code. For instance, when the last character is wide and there is only 1 column left on the current row, that wide character doesn't fit the remaining space. So the wide character must be displayed at the beginning of the next line. This code handles it.

One more thing that I did is to skip all the ANSI escape sequence characters when calculating column position/width in the 3rd commit. This change enables us to use color in the prompt text.

I am really excited to see the new API in the near future. Please let me know if you have any questions on this matter. I am sure that you will do a fantastic job!!

@yhirose
Copy link

yhirose commented Oct 29, 2015

After researching more about dependencies between the linenoise code and the UTF-8 encoding code according to your design goal, I realized that only three functions are needed when adding other encoding support.

Based on the research, I have updated my branch. Here is the diff between the linenoise head and the utf8-support branch. As you could see there, I got rid of all UTF-8 specific code completely from linenoise.c and put them into encodings/utf8.h and encodings/utf8.c. Also I added one experiment API called linenoiseSetEncodingFunctions on linenoise.h, so that users could set their own set of encoding functions. I confirmed all the functionalities still work.

Here is a snippet of my current experimental API:

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

void linenoiseSetEncodingFunctions(
    linenoisePrevCharLen *prevCharLenFunc,
    linenoiseNextCharLen *nextCharLenFunc,
    linenoiseReadCode *readCodeFunc);

linenoisePrevCharLen and linenoiseNextCharLen return byte length as the return value, and set column length to col_len parameter. linenoiseReadCode reads bytes into buf, and convert the bytes and set a meaningful character code for the encoding to c parameter.

If users don't call linenoiseSetEncodingFunctions, it'll end up calling default implementations. They simply handle one byte as a character.

Hope the post will be helpful when you design the new encoding API. I am really looking forward to it!!

@antirez
Copy link
Owner

antirez commented Nov 8, 2015

@yhirose that's a fantastic work!!! :-) I'm going to check the code and merge it. Thank you for this.

@henriqueleng
Copy link

Not merged yet?

@dumblob
Copy link

dumblob commented Jun 25, 2016

@antirez any progress on merging it?

@yhirose
Copy link

yhirose commented Jun 28, 2016

I have modified my fork (https://github.com/yhirose/linenoise/tree/utf8-support) to catch up with the recent changes made in the original linenoise such as 'hints' feature.

@Sonophoto
Copy link

Thank you very much @yhirose. You have made good code better! and my
job easier!
 
@Sonophoto

On Mon, 27 Jun 2016 18:56:45 -0700, yhirose wrote:

   I have modified my fork 

(https://github.com/yhirose/linenoise/tree/utf8-support) to catch up
with the recent changes made in the original linenoise such as 'hints'
feature.

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@yhirose
Copy link

yhirose commented Oct 25, 2016

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 9.0.

@aleclarson
Copy link

@antirez Will you have free time in the near future to merge @yhirose's multi-byte support? Or should we switch https://github.com/hoelzro/lua-linenoise to use @yhirose's fork until then? ✌️

@yhirose
Copy link

yhirose commented Oct 15, 2018

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 11.0 and includes all the recent changes made in antirez/linenoise.

@yhirose
Copy link

yhirose commented Jul 10, 2019

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 12.1.

@yhirose
Copy link

yhirose commented Apr 24, 2020

My fork (https://github.com/yhirose/linenoise/tree/utf8-support) now supports Unicode 13.0.

@mcfriend99
Copy link

mcfriend99 commented Apr 6, 2022

@yhirose can jgriffiths solution for Win32 support in #8 be merged into the utf-8 support branch?? Also, you may consider merging the UTF-8 support into your main branch or moving the project into a different repository. A lot of us use it!

@yhirose
Copy link

yhirose commented Apr 6, 2022

@mcfriend99, thanks for your suggestion, but I am not interested in merging the Win32 specific code into this branch. My intention of this patch is to make the current linenoise code UTF8 compatible with the smallest possible effort and keeping the original linenoise code structure as much as possible.

As for moving to main branch, I'll take a look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests