Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept wide chars in filter input (#1038) #1105

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

adsr
Copy link
Contributor

@adsr adsr commented Oct 13, 2022

(For hacktoberfest)

Notably this doesn't handle wcwidth!=1 or extended grapheme clusters, but you can filter for your emoji process now.

Handles wcwidth!=1 now if the input is validly encoded. See FunctionBar_displayWidth.

Copy link
Member

@BenBE BenBE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please merge the commits into one?

FunctionBar.c Outdated Show resolved Hide resolved
Panel.c Outdated Show resolved Hide resolved
Panel.c Outdated Show resolved Hide resolved
Panel.c Outdated Show resolved Hide resolved
@BenBE BenBE added the enhancement Extension or improvement to existing feature label Oct 13, 2022
@adsr
Copy link
Contributor Author

adsr commented Oct 14, 2022

Added your fixes and squashed. Thanks @BenBE.

IncSet.c Outdated Show resolved Hide resolved
IncSet.c Show resolved Hide resolved
FunctionBar.c Outdated
static size_t FunctionBar_safeLen(const char *s) {
size_t slen = mbstowcs(NULL, s, 0);
if (slen == (size_t)-1) {
slen = strlen(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like some explanation on the logic. mbstowcs returns (size_t)-1 when encountered an illegal byte, but why fall back to strlen? How is this function meant to be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function implements "get multi-byte string length, otherwise byte length". It is used below to position the cursor on the filter input field. In the happy path the string will have valid encoding, mbstowcs will succeed, and we'll position the cursor in the right place. If s is garbage, byte length is the only sensible fallback. safeLen is maybe not the best name.

Thanks for the review. I'll add your fixes shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adse So this function is meant to be getting a "display width"?
Unfortunately there is a complexity here. Not every Unicode character occupies one column in the terminal. CJK characters would occupy two. So assuming the "display width" of the string is the same as number of code points in the string (i.e. the return value of mbstowcs) is wrong. The best approach as far as I can find is to use wcwidth and wcswidth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. (I wrote this in the PR description.) This was only meant as an incremental improvement with proper handling and rendering as a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed FunctionBar_safeLen to FunctionBar_displayWidth which sums wcwidth of each wchar and falls back on strlen on invalid encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do need to filter that for display purposes (something that htop didn't implement yet, unfortunately).

I agree that filtering on display is needed, however I disagree with your code snippet (filtering input). As I stated, non-printables and invalid encodings are allowed in process names, so disallowing them as input would prevent users from matching such processes. Of course this is a very narrow edge case and naturally I will defer to the maintainers, but that is my view.

@BenBE or someone else want to weigh in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that filtering on display is needed, however I disagree with your code snippet (filtering input). As I stated, non-printables and invalid encodings are allowed in process names, so disallowing them as input would prevent users from matching such processes. Of course this is a very narrow edge case and naturally I will defer to the maintainers, but that is my view.

My idea is that the invalid characters are disallowed, in raw, from user input. They may only be input via escape sequences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Explorer09 Can you elaborate on the "input via escape sequences" part? Currently the filter/search functions don't do any escape sequence processing. If this was to be added, that'd be a different feature.

Also on the part of invalid encodings: As those are possible to occur as part of any process name/information we should not hinder filtering/searching for them. Thus if you limit inputs to valid character sequences only you'll also have to allow the user to specify if they wanted to look for an invalid encoding (intentionally).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE An example is embedded newlines in filenames, if you are adding that to the filter string, you should probably type it as \n and not the Enter key as the latter would mean something else. Likewise, you will type \033 for the ASCII ESC character in the filename and not the Escape key. That's the filtering I was talking about.

You will need to filter this from the keyboard input (within the GUI), but not from the -F option argument of htop (the escape sequence of the latter is processed by the shell).

Copy link
Member

@BenBE BenBE Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Explorer09 Exactly. And this handling is currently not implemented. Same for when you need to look for Hällö Wörld in a process that uses ISO-8859-1 instead of UTF-8 …

Panel.c Outdated Show resolved Hide resolved
@adsr adsr force-pushed the mbfilter branch 2 times, most recently from 94b827b to 6dd3125 Compare October 14, 2022 04:08
IncSet.c Outdated
}

wchar_t w[INCMODE_MAX + 1];
size_t wlen = mbstowcs(w, s, slen);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size limit of mbstowcs is number of wide characters (code points), not bytes.
The argument is meant to prevent buffer overflow of w, not the limit the buffer length of s .
Perhaps using mbrtowc with a loop instead would do what you want. The size limit of mbrtowc is indeed in bytes and it limits the length of s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Would it not suffice to change the 3rd param to INCMODE_MAX + 1 and leave the rest? I believe the calling function ensures s is null-terminated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adsr

  1. You need a NUL character, so the parameter should be INCMODE_MAX without +1, and you need this explicit line w[INCMODE_MAX] = 0; to ensure the wide string is null-terminated.

  2. If you just set the 3rd parameter of mbstowcs to INCMODE_MAX, then it would make no sense to pass in slen parameter to the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adsr
3. I've just read the caller code of IncSet_handleKey:

  } else if (ch == KEY_BACKSPACE || ch == 127) {
      if (mode->index > 0) {
         mode->index--;
         mode->index -= IncSet_lastCharLen(mode->buffer, mode->index);
         mode->buffer[mode->index] = 0;

mode->index is the (byte) index to the last character of the filter string. With mbstowcs you cannot convert part of the multi-byte string with the given number of bytes. To convert part of the string you have a few options:

(a) Use mbtowc or mbrtowc with a loop as mentioned in my last comment.
(b) Duplicate mode->buffer, write a NUL byte in the right position, and pass this buffer to mbstowcs. I don't like this approach as you need to handle when the buffer allocation fails, and you need to free the buffer afterward.
(c) Change mode->index so it becomes code point index rather than byte index. This requires much more work and more risk breaking something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With mbstowcs you cannot convert part of the multi-byte string with the given number of bytes.

As written, I don't think we can count on any of these mb* functions working unless we can guarantee the buffer is validly encoded, which we are not doing (yet). Therefore I don't see a big difference between using mbstowcs or looping with mbtowc. They both can fail and then you have employ some sort of fallback logic.

For example, consider the string "testλ" except let's remove the last byte, so 0x74 65 73 74 ce. In this case the mbstowcs in lastCharLen will return -1, so the function will return 1 leading to use deleting the 0xce byte. This is reasonable. Looping with mbtowc, we'd similarly fail on the last byte of the string. If the invalid encoding was in the middle of the string instead of the end, the behavior could be different, but it's not clear to me what the right thing to do there is. The end-user would have to be trying pretty hard to make that happen.

For the happy path, consider the full string "testλ" 0x74 65 73 74 ce bb. In this case the mbstowcs succeeds and w holds ['t', 'e', 's', 't', 'λ', 0]. We measure the byte length of "λ" using wcrtomb and return 2, leading us to delete the "λ". Also good.

The proper solution is a lot more work as you point out. IMO what's here will work for most cases in what is already an edge case (non-ASCII process names).

Let me know if I've missed something. Thanks for the reviews @Explorer09.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adsr How about this: If there is an invalid byte encoding in the middle of the string, cut the invalid byte and everything after it. You shouldn't trust the string input by the user, so there would be a sanitization routine somewhere. Cutting just the last byte might not be sufficient. Consider a 4-byte UTF-8 character F0 9F 8D 8E (U+1F34E) and one last byte is Iost and the string becomes F0 9F 8D, a backspace should remove all three invalid bytes so that the rest of the sequence is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that also works. I renamed the function to IncSet_backspaceLen and changed it to loop with mbtowc as you describe.

Copy link
Contributor Author

@adsr adsr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews.

@adsr adsr force-pushed the mbfilter branch 2 times, most recently from 8060f93 to e0326d0 Compare October 16, 2022 05:24
Panel.c Outdated Show resolved Hide resolved
Panel.c Outdated Show resolved Hide resolved
@adsr adsr force-pushed the mbfilter branch 2 times, most recently from 08af816 to b90a89e Compare October 16, 2022 17:00
Notably this doesn't handle wcwidth!=1 or extended grapheme clusters, but you
can filter for your emoji process now.
@Explorer09
Copy link
Contributor

I have a concern about Panel_getCh function. The API was designed to return a byte despite us implementing multi-byte support for it. And the function maintains a state (mbbuf, mbbuf_i and mbbuf_len) to make it work with multi-byte. I wonder if it is possible to avoid the state by changing the function to return the multi-byte sequence directly.
I'm thinking of a function prototype like this:

int Panel_readMultiByteChar(Panel* this, char *mbBuf)
// mbBuf must have at least (MB_LEN_MAX + 1) bytes. The function outputs the
// multi-byte sequence in mbBuf and the sequence is terminated by a 0x00 byte.

This API would be more future-proof as it no longer returns a byte at the middle of the sequence, which could lead to a (potential) synchronization problem (I know this synchronization problem won't affect UTF-8 though).

@adsr
Copy link
Contributor Author

adsr commented Oct 18, 2022

@BenBE Could you add the hacktoberfest-accepted label? It takes a few days for it to register and I'd like to get credit for it if you think it's worthy of course. Sorry to bother.

@BenBE
Copy link
Member

BenBE commented Oct 20, 2022

@Explorer09 Are there any other open issues with this PR that you currently see?

AFAICS we have one issue re escaping of non-printables/invalid encodings for passing as a filter (the escape handling is out of scope, the proper filtering should be dealt with).

Also there's the last_wkey global I don't quite like. How much work would it take to avoid it?

@adsr We discussed this in the team and have not reached agreement on this just yet.

@adsr
Copy link
Contributor Author

adsr commented Oct 21, 2022

Also there's the last_wkey global I don't quite like. How much work would it take to avoid it?

I think the main alternative is to modify Panel_getCh and its callers to use wint_t instead of int or char as @Explorer09 suggested. We might be able limit the scope of the change by introducing some proxy functions that call the existing functions in a loop, e.g., InfoScreen_run invokes Panel_getWch and then InfoScreen_run_inner in a loop with the individual bytes of the wide char. IMO this would be uglier than the current implementation, so I'd vote either for changing it all or sticking with the current implementation.

@BenBE BenBE added this to the 3.3.0 milestone Feb 3, 2023
@BenBE BenBE linked an issue Feb 3, 2023 that may be closed by this pull request
@BenBE BenBE modified the milestones: 3.3.0, 3.4.0 Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Extension or improvement to existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter function does not work with non-ASCII characters
3 participants