Accept wide chars in filter input (#1038) #1105

adsr · 2022-10-13T06:20:09Z

(For hacktoberfest)

~~Notably this doesn't handle wcwidth!=1 or extended grapheme clusters, but you can filter for your emoji process now.~~

Handles wcwidth!=1 now if the input is validly encoded. See FunctionBar_displayWidth.

BenBE

Could you please merge the commits into one?

FunctionBar.c

Panel.c

adsr · 2022-10-14T00:24:07Z

Added your fixes and squashed. Thanks @BenBE.

IncSet.c

Explorer09 · 2022-10-14T03:53:44Z

FunctionBar.c

+static size_t FunctionBar_safeLen(const char *s) {
+   size_t slen = mbstowcs(NULL, s, 0);
+   if (slen == (size_t)-1) {
+      slen = strlen(s);


I would like some explanation on the logic. mbstowcs returns (size_t)-1 when encountered an illegal byte, but why fall back to strlen? How is this function meant to be used?

This function implements "get multi-byte string length, otherwise byte length". It is used below to position the cursor on the filter input field. In the happy path the string will have valid encoding, mbstowcs will succeed, and we'll position the cursor in the right place. If s is garbage, byte length is the only sensible fallback. safeLen is maybe not the best name.

Thanks for the review. I'll add your fixes shortly.

@adse So this function is meant to be getting a "display width"?
Unfortunately there is a complexity here. Not every Unicode character occupies one column in the terminal. CJK characters would occupy two. So assuming the "display width" of the string is the same as number of code points in the string (i.e. the return value of mbstowcs) is wrong. The best approach as far as I can find is to use wcwidth and wcswidth.

You're right. (I wrote this in the PR description.) This was only meant as an incremental improvement with proper handling and rendering as a follow-up.

I renamed FunctionBar_safeLen to FunctionBar_displayWidth which sums wcwidth of each wchar and falls back on strlen on invalid encoding.

You do need to filter that for display purposes (something that htop didn't implement yet, unfortunately).

I agree that filtering on display is needed, however I disagree with your code snippet (filtering input). As I stated, non-printables and invalid encodings are allowed in process names, so disallowing them as input would prevent users from matching such processes. Of course this is a very narrow edge case and naturally I will defer to the maintainers, but that is my view.

@BenBE or someone else want to weigh in?

I agree that filtering on display is needed, however I disagree with your code snippet (filtering input). As I stated, non-printables and invalid encodings are allowed in process names, so disallowing them as input would prevent users from matching such processes. Of course this is a very narrow edge case and naturally I will defer to the maintainers, but that is my view.

My idea is that the invalid characters are disallowed, in raw, from user input. They may only be input via escape sequences.

@Explorer09 Can you elaborate on the "input via escape sequences" part? Currently the filter/search functions don't do any escape sequence processing. If this was to be added, that'd be a different feature.

Also on the part of invalid encodings: As those are possible to occur as part of any process name/information we should not hinder filtering/searching for them. Thus if you limit inputs to valid character sequences only you'll also have to allow the user to specify if they wanted to look for an invalid encoding (intentionally).

@BenBE An example is embedded newlines in filenames, if you are adding that to the filter string, you should probably type it as \n and not the Enter key as the latter would mean something else. Likewise, you will type \033 for the ASCII ESC character in the filename and not the Escape key. That's the filtering I was talking about.

You will need to filter this from the keyboard input (within the GUI), but not from the -F option argument of htop (the escape sequence of the latter is processed by the shell).

@Explorer09 Exactly. And this handling is currently not implemented. Same for when you need to look for Hällö Wörld in a process that uses ISO-8859-1 instead of UTF-8 …

Panel.c

Explorer09 · 2022-10-14T04:54:38Z

IncSet.c

+   }
+
+   wchar_t w[INCMODE_MAX + 1];
+   size_t wlen = mbstowcs(w, s, slen);


The size limit of mbstowcs is number of wide characters (code points), not bytes.
The argument is meant to prevent buffer overflow of w, not the limit the buffer length of s .
Perhaps using mbrtowc with a loop instead would do what you want. The size limit of mbrtowc is indeed in bytes and it limits the length of s.

Good call. Would it not suffice to change the 3rd param to INCMODE_MAX + 1 and leave the rest? I believe the calling function ensures s is null-terminated.

@adsr

You need a NUL character, so the parameter should be INCMODE_MAX without +1, and you need this explicit line w[INCMODE_MAX] = 0; to ensure the wide string is null-terminated.

If you just set the 3rd parameter of mbstowcs to INCMODE_MAX, then it would make no sense to pass in slen parameter to the function.

@adsr
3. I've just read the caller code of IncSet_handleKey:

} else if (ch == KEY_BACKSPACE || ch == 127) { if (mode->index > 0) { mode->index--; mode->index -= IncSet_lastCharLen(mode->buffer, mode->index); mode->buffer[mode->index] = 0;

mode->index is the (byte) index to the last character of the filter string. With mbstowcs you cannot convert part of the multi-byte string with the given number of bytes. To convert part of the string you have a few options:

(a) Use mbtowc or mbrtowc with a loop as mentioned in my last comment.
(b) Duplicate mode->buffer, write a NUL byte in the right position, and pass this buffer to mbstowcs. I don't like this approach as you need to handle when the buffer allocation fails, and you need to free the buffer afterward.
(c) Change mode->index so it becomes code point index rather than byte index. This requires much more work and more risk breaking something.

With mbstowcs you cannot convert part of the multi-byte string with the given number of bytes.

As written, I don't think we can count on any of these mb* functions working unless we can guarantee the buffer is validly encoded, which we are not doing (yet). Therefore I don't see a big difference between using mbstowcs or looping with mbtowc. They both can fail and then you have employ some sort of fallback logic.

For example, consider the string "testλ" except let's remove the last byte, so 0x74 65 73 74 ce. In this case the mbstowcs in lastCharLen will return -1, so the function will return 1 leading to use deleting the 0xce byte. This is reasonable. Looping with mbtowc, we'd similarly fail on the last byte of the string. If the invalid encoding was in the middle of the string instead of the end, the behavior could be different, but it's not clear to me what the right thing to do there is. The end-user would have to be trying pretty hard to make that happen.

For the happy path, consider the full string "testλ" 0x74 65 73 74 ce bb. In this case the mbstowcs succeeds and w holds ['t', 'e', 's', 't', 'λ', 0]. We measure the byte length of "λ" using wcrtomb and return 2, leading us to delete the "λ". Also good.

The proper solution is a lot more work as you point out. IMO what's here will work for most cases in what is already an edge case (non-ASCII process names).

Let me know if I've missed something. Thanks for the reviews @Explorer09.

@adsr How about this: If there is an invalid byte encoding in the middle of the string, cut the invalid byte and everything after it. You shouldn't trust the string input by the user, so there would be a sanitization routine somewhere. Cutting just the last byte might not be sufficient. Consider a 4-byte UTF-8 character F0 9F 8D 8E (U+1F34E) and one last byte is Iost and the string becomes F0 9F 8D, a backspace should remove all three invalid bytes so that the rest of the sequence is valid.

Sure that also works. I renamed the function to IncSet_backspaceLen and changed it to loop with mbtowc as you describe.

adsr

Thanks for the reviews.

Panel.c

Notably this doesn't handle wcwidth!=1 or extended grapheme clusters, but you can filter for your emoji process now.

Explorer09 · 2022-10-17T04:55:46Z

I have a concern about Panel_getCh function. The API was designed to return a byte despite us implementing multi-byte support for it. And the function maintains a state (mbbuf, mbbuf_i and mbbuf_len) to make it work with multi-byte. I wonder if it is possible to avoid the state by changing the function to return the multi-byte sequence directly.
I'm thinking of a function prototype like this:

int Panel_readMultiByteChar(Panel* this, char *mbBuf)
// mbBuf must have at least (MB_LEN_MAX + 1) bytes. The function outputs the
// multi-byte sequence in mbBuf and the sequence is terminated by a 0x00 byte.

This API would be more future-proof as it no longer returns a byte at the middle of the sequence, which could lead to a (potential) synchronization problem (I know this synchronization problem won't affect UTF-8 though).

adsr · 2022-10-18T03:15:00Z

@BenBE Could you add the hacktoberfest-accepted label? It takes a few days for it to register and I'd like to get credit for it if you think it's worthy of course. Sorry to bother.

BenBE · 2022-10-20T11:21:05Z

@Explorer09 Are there any other open issues with this PR that you currently see?

AFAICS we have one issue re escaping of non-printables/invalid encodings for passing as a filter (the escape handling is out of scope, the proper filtering should be dealt with).

Also there's the last_wkey global I don't quite like. How much work would it take to avoid it?

@adsr We discussed this in the team and have not reached agreement on this just yet.

adsr · 2022-10-21T05:34:33Z

Also there's the last_wkey global I don't quite like. How much work would it take to avoid it?

I think the main alternative is to modify Panel_getCh and its callers to use wint_t instead of int or char as @Explorer09 suggested. We might be able limit the scope of the change by introducing some proxy functions that call the existing functions in a loop, e.g., InfoScreen_run invokes Panel_getWch and then InfoScreen_run_inner in a loop with the individual bytes of the wide char. IMO this would be uglier than the current implementation, so I'd vote either for changing it all or sticking with the current implementation.

BenBE requested changes Oct 13, 2022

View reviewed changes

FunctionBar.c Outdated Show resolved Hide resolved

Panel.c Outdated Show resolved Hide resolved

Panel.c Outdated Show resolved Hide resolved

Panel.c Outdated Show resolved Hide resolved

BenBE added the enhancement Extension or improvement to existing feature label Oct 13, 2022

adsr force-pushed the mbfilter branch from e1e3eb0 to 71c46d2 Compare October 14, 2022 00:20

Explorer09 reviewed Oct 14, 2022

View reviewed changes

IncSet.c Outdated Show resolved Hide resolved

Explorer09 reviewed Oct 14, 2022

View reviewed changes

IncSet.c Show resolved Hide resolved

Explorer09 reviewed Oct 14, 2022

View reviewed changes

Panel.c Outdated Show resolved Hide resolved

adsr force-pushed the mbfilter branch 2 times, most recently from 94b827b to 6dd3125 Compare October 14, 2022 04:08

Explorer09 reviewed Oct 14, 2022

View reviewed changes

adsr commented Oct 14, 2022

View reviewed changes

adsr force-pushed the mbfilter branch 2 times, most recently from 8060f93 to e0326d0 Compare October 16, 2022 05:24

BenBE reviewed Oct 16, 2022

View reviewed changes

Panel.c Outdated Show resolved Hide resolved

Panel.c Outdated Show resolved Hide resolved

adsr force-pushed the mbfilter branch 2 times, most recently from 08af816 to b90a89e Compare October 16, 2022 17:00

Accept wide chars in filter input (htop-dev#1038)

5860ecd

Notably this doesn't handle wcwidth!=1 or extended grapheme clusters, but you can filter for your emoji process now.

adsr force-pushed the mbfilter branch from b90a89e to 5860ecd Compare October 17, 2022 01:08

BenBE added this to the 3.3.0 milestone Feb 3, 2023

BenBE linked an issue Feb 3, 2023 that may be closed by this pull request

Filter function does not work with non-ASCII characters #1038

Open

BenBE modified the milestones: 3.3.0, 3.4.0 Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept wide chars in filter input (#1038) #1105

Accept wide chars in filter input (#1038) #1105

adsr commented Oct 13, 2022 •

edited

BenBE left a comment

adsr commented Oct 14, 2022

Explorer09 Oct 14, 2022

adsr Oct 14, 2022

Explorer09 Oct 14, 2022

adsr Oct 14, 2022

adsr Oct 16, 2022

adsr Oct 18, 2022

Explorer09 Oct 18, 2022

BenBE Oct 18, 2022

Explorer09 Oct 19, 2022 •

edited

BenBE Oct 19, 2022 •

edited

Explorer09 Oct 14, 2022

adsr Oct 14, 2022

Explorer09 Oct 14, 2022 •

edited

Explorer09 Oct 14, 2022

adsr Oct 14, 2022

Explorer09 Oct 15, 2022

adsr Oct 16, 2022

adsr left a comment

Explorer09 commented Oct 17, 2022

adsr commented Oct 18, 2022

BenBE commented Oct 20, 2022

adsr commented Oct 21, 2022

Accept wide chars in filter input (#1038) #1105

Are you sure you want to change the base?

Accept wide chars in filter input (#1038) #1105

Conversation

adsr commented Oct 13, 2022 • edited

BenBE left a comment

Choose a reason for hiding this comment

adsr commented Oct 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Explorer09 Oct 19, 2022 • edited

Choose a reason for hiding this comment

BenBE Oct 19, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Explorer09 Oct 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adsr left a comment

Choose a reason for hiding this comment

Explorer09 commented Oct 17, 2022

adsr commented Oct 18, 2022

BenBE commented Oct 20, 2022

adsr commented Oct 21, 2022

adsr commented Oct 13, 2022 •

edited

Explorer09 Oct 19, 2022 •

edited

BenBE Oct 19, 2022 •

edited

Explorer09 Oct 14, 2022 •

edited