Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonprintable and control characters are not escaped in the prompt #632

Open
Rycieos opened this issue Nov 11, 2020 · 5 comments
Open

Nonprintable and control characters are not escaped in the prompt #632

Rycieos opened this issue Nov 11, 2020 · 5 comments
Assignees
Labels
enhancement Feature request

Comments

@Rycieos
Copy link
Collaborator

Rycieos commented Nov 11, 2020

Shell: all bash and zsh versions (bash 3.2-5.1, zsh 5.5.1)
Liquidprompt version: v_1.11 though v1.21.1

Problem description

Unix filenames have always been a problem for shells, as they can include any character accept the null char, and a filename can't include the forward slash (/) since it is the path delimiter. Other than those two characters, it can include anything, including escape characters. These characters by design either have no visual representation (historically terminal control characters) or their visual representation is whitespace only (newline, tab, etc.).

Currently, in Liquidprompt, if a directory includes a special character, it is treated just like any "normal" character, and printed directly to the terminal. For control characters, this either shows nothing, or adds whitespace to the prompt. This can be annoying, like an extra newline, or confusing and/or break the prompt, like a carriage return or vertical tab. This isn't a problem limited to directory names, as any external data source could include any character except null in Bash or even null in Zsh.

This isn't a problem unique to Liquidprompt. Bash, when expanding the current working directory directive (\w or \W), strips control characters, not displaying them at all (until Bash 5.1, which prints them literally). I can't find any documentation describing this behavior. Zsh, when expanding the current working directory directive (%~), replaces some with a \<char> code (eg \n, \t), and others with a ^<letter> notation (eg ^A, ^H). This is better, but the inconstancy between the two forms can be confusing, and Zsh does nothing to escape a literal backslash or caret (\ or ^), meaning a directory named with ANSI-C notation $'bar\\nope\nfoo' will print as bar\nope\nfoo, making the first \n sequence appear to be a newline character, when it is actually a backslash followed by the letter n.

The lack of bug reports on this topic, and the fact that Bash also chooses to mostly ignore the problem, probably mean that this isn't an issue that many users face. But it isn't always up to the user to name their directories, VCS branch names, and other data sources: often external sources or programs can inject otherwise invisible characters into such strings, and pretending that won't happen won't fix the problem.

Possible Options

We have a few options to resolve this issue:

  1. Do nothing. Printing these characters can never break the terminal, but might make the prompt unreadable.
  2. Strip unprintable characters, like Bash does. This sounds simple, but has unexpected complications. And a user might get confused or stuck on trying to reference a directory or branch that doesn't exist because their prompt changed the string.
  3. Replace unprintable characters with string representations. This is my recommendation. A drawback is that there will rarely be a case where this is useful, but the check will always slow down the prompt.
  4. Default to 3), but allow for users to customize what strings replace unprintables.

Possible Implementations

2).

This is simple to implement, but see below for encoding issues that might come up.

string=${string//[$'\001\002\003\004\005\006\a\b\t\n\v\f\r\016\017\020\021\022\023\024\025\026\027\030\031\032\e\034\035\036\037\177\000']}

Note also that the null character (\0) is last, as in Bash that character will terminate a string (and therefore can't be stripped, but also can't be in the input string in the first place).

EDIT:
Even simpler would be to use the character class:

string=${string//[[:cntrl:]]}

3).

This is a replacement for the current _lp_escape() function that only escapes backslashes on Bash and percents (%) on Zsh so they are printed literally in the prompt. This would still do that, but would also replace control characters.

The difficulty here is that we can't fork out to another tool to help. This function is called multiple times per prompt generation, and would be too slow. The only way we can take this option is if it doesn't add slowdown. printf is a builtin, and by using the -v flag, we can assign the output directly to a variable without using a subshell.

The ls command has a --quoting-style=shell-escape option (which seems to be the default on modern systems) that does this string escaping nicely for directories. We can't call out to ls to do this escaping for us for multiple reasons, but it might provide some tips on how to do it correctly.

_LP_ESCAPES=(
    0   001 002 003 004 005 006 a
    b   t   n   v   f   r   016 017
    020 021 022 023 024 025 026 027
    030 031 032 e   034 035 036 037
)
_LP_ESCAPES[127+_LP_FIRST_INDEX]=177

__lp_escape() {
    local string=${1-}
    # Replace with two backslashes to distinguish from escape codes
    string=${string//'\'/$_LP_BACKSLASH$_LP_BACKSLASH}
    # Replace with two percents if the shell needs it escaped
    string=${string//'%'/$_LP_PERCENT}
    local -i codepoint i
    ret=
    for (( i=0; i<${#string}; i++ )); do
        # The ' literal in the string tells printf to convert to the integer codepoint of the character
        printf -v codepoint '%d' "'${string:$i:1}"
        if (( codepoint < 32 || codepoint == 127 )); then
            ret+=${_LP_BACKSLASH}${_LP_ESCAPES[codepoint+_LP_FIRST_INDEX]}
        else
            ret+=${string:$i:1}
        fi
    done
}

The above implementation displays the characters as ls does: \<char> if it has a shorthand form, otherwise as \0xx where the number is the character's codepoint in octal.

Alternative encodings are still a problem here (see below). The for loop isn't ideal, but it is fast.

EDIT:
Using a check like below could skip the special encoding loop unless one of these characters exists:

if [[ -n "${string//[[:print:]]}" ]]; then
  # Do the encoding

Other complications

Liquidprompt is written in UTF-8, and all text pattern matching is ASCII only. Not every user terminal will be UTF-8 though, so whatever character matching we do needs to match (and maybe replace) control characters in the current encoding. Discovering control characters in the current encoding may be difficult or impossible. All of my suggested implementations are assuming ASCII or UTF-8, and don't deal with any characters outside of the ASCII range.

@Rycieos Rycieos added enhancement Feature request help wanted An issue where ideas or patches are welcome labels Nov 11, 2020
@Rycieos Rycieos added this to the v2.0 milestone Nov 11, 2020
@Rycieos Rycieos self-assigned this Nov 11, 2020
@Rycieos
Copy link
Collaborator Author

Rycieos commented Nov 11, 2020

@dolmen and @nojhan, if you have any thoughts or suggestions I would appreciate them.

@Rycieos Rycieos pinned this issue Dec 21, 2020
Rycieos added a commit that referenced this issue Jan 22, 2021
Unix filenames have always been a problem for shells, as they can
include any character except the null char, and a filename can't include
the forward slash (/) since it is the path delimiter. Other than those
two characters, it can include anything, including escape characters.
These characters by design either have no visual representation
(historically terminal control characters) or their visual
representation is whitespace only (newline, tab, etc.).

Currently, in Liquidprompt, if a directory includes a special character,
it is treated just like any "normal" character, and printed directly to
the terminal. For control characters, this either shows nothing, or adds
whitespace to the prompt. This can be annoying, like an extra newline,
or confusing and/or break the prompt, like a carriage return or vertical
tab. This isn't a problem limited to directory names, as any external
data source could include any character except null in Bash or even null
in Zsh.

This isn't a problem unique to Liquidprompt. Bash, when expanding the
current working directory directive (\w or \W), strips control
characters, not displaying them at all (until Bash 5.1, which prints
them literally). I can't find any documentation describing this
behavior. Zsh, when expanding the current working directory directive
(%~), replaces some with a \<char> code (eg \n, \t), and others with a
^<letter> notation (eg ^A, ^H). This is better, but the inconstancy
between the two forms can be confusing, and Zsh does nothing to escape a
literal backslash or caret (\ or ^), meaning a directory named with
ANSI-C notation $'bar\\nope\nfoo' will print as bar\nope\nfoo, making
the first \n sequence appear to be a newline character, when it is
actually a backslash followed by the letter n.

The lack of bug reports on this topic, and the fact that Bash also
chooses to mostly ignore the problem, probably mean that this isn't an
issue that many users face. But it isn't always up to the user to name
their directories, VCS branch names, and other data sources: often
external sources or programs can inject otherwise invisible characters
into such strings, and pretending that won't happen won't fix the
problem.

This commit replaces unprintable characters with string representations
using backslashes. A drawback is that there will rarely be a case where
this is useful, but the check will always slightly slow down the prompt.

Fixes #632
Rycieos added a commit that referenced this issue Jan 22, 2021
Unix filenames have always been a problem for shells, as they can
include any character except the null char, and a filename can't include
the forward slash (/) since it is the path delimiter. Other than those
two characters, it can include anything, including escape characters.
These characters by design either have no visual representation
(historically terminal control characters) or their visual
representation is whitespace only (newline, tab, etc.).

Currently, in Liquidprompt, if a directory includes a special character,
it is treated just like any "normal" character, and printed directly to
the terminal. For control characters, this either shows nothing, or adds
whitespace to the prompt. This can be annoying, like an extra newline,
or confusing and/or break the prompt, like a carriage return or vertical
tab. This isn't a problem limited to directory names, as any external
data source could include any character except null in Bash or even null
in Zsh.

This isn't a problem unique to Liquidprompt. Bash, when expanding the
current working directory directive (\w or \W), strips control
characters, not displaying them at all (until Bash 5.1, which prints
them literally). I can't find any documentation describing this
behavior. Zsh, when expanding the current working directory directive
(%~), replaces some with a \<char> code (eg \n, \t), and others with a
^<letter> notation (eg ^A, ^H). This is better, but the inconstancy
between the two forms can be confusing, and Zsh does nothing to escape a
literal backslash or caret (\ or ^), meaning a directory named with
ANSI-C notation $'bar\\nope\nfoo' will print as bar\nope\nfoo, making
the first \n sequence appear to be a newline character, when it is
actually a backslash followed by the letter n.

The lack of bug reports on this topic, and the fact that Bash also
chooses to mostly ignore the problem, probably mean that this isn't an
issue that many users face. But it isn't always up to the user to name
their directories, VCS branch names, and other data sources: often
external sources or programs can inject otherwise invisible characters
into such strings, and pretending that won't happen won't fix the
problem.

This commit replaces unprintable characters with string representations
using backslashes. A drawback is that there will rarely be a case where
this is useful, but the check will always slightly slow down the prompt.

Fixes #632
@Rycieos
Copy link
Collaborator Author

Rycieos commented Jan 22, 2021

My recommended solution (3 above) uses printf -v, which according to the Zsh release notes, is only supported since Zsh-5.3. At the moment Liquidprompt supports Zsh-5.1+ (or maybe Zsh-5.0+, there are some bugs right now), so this would be a pretty large loss.

I don't know of any other way to turn a character into its codepoint without make a forking call, which is needed since this would be called for each character printed to the screen. With a fork, it would be much too slow.

As the end of the v2.0 merge window is approaching, I am shoving this to v2.1 at least. Since there is no demand for this right now, this will sit on the backburner until a solution is found that does not have any speed loss.

@Rycieos Rycieos modified the milestones: v2.0, v2.1 Jan 22, 2021
@dolmen
Copy link
Collaborator

dolmen commented May 19, 2021

This is mostly a theorical problem. While the name of the current directory might contain control characters in theory, fixing this problem in LP requires much code and that code would have much runtime impact (if we can't use shell built-ins), while in practice you never have such characters.

When the current directory has bad characters, it's just easier for the user to temporarily disable the not-so-smart-prompt or to switch to another shell which doesn't have LP enabled. angel-PS1 has commands angel off and angel on just for that purpose.

Of course, control characters in directory name can be a security problem if they are not escaped. But I think that because this is in an interactive shell the user is responsible for cd-ing in a directory whose name is controlled by an external malicious entity.

So my opinion is that a big warning in LP documentation is better than the performance impact.

@Rycieos
Copy link
Collaborator Author

Rycieos commented May 19, 2021

You are probably right. I also discovered that Fossil allows for almost any character in branch names. This can cause issues if checking out someone else's repository, where they have branch names containing characters. This is slightly less in the control of the terminal user, but I would agree that it is probably not worth the effort or runtime cost.

I think the most likely scenario for this happening would be if some program created a directory with a newline or similar character in it. I have seen that before, but it is rare, and ls can helpfully show the name in a readable way.

I'll probably leave this as a "would be neat" feature for the backlog, and if it ever does get implemented, unless it has no performance impact, it would be disabled by default.

@Rycieos Rycieos removed this from the v2.1 milestone May 19, 2021
@Rycieos Rycieos removed the help wanted An issue where ideas or patches are welcome label May 19, 2021
@dolmen
Copy link
Collaborator

dolmen commented May 19, 2021

1/ Problems may happen if the current directory (or branch) contains control chars.
2/ The starting directory of LP is usually the home directory of the user, so zero risks here.
3/ For other directories, this requires that the user has cd into that directory. To type the cd command, the user should have seen the control characters to enter them.

Shell security problems exist. But this one is low on the list.

@dolmen dolmen unpinned this issue May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature request
Projects
None yet
Development

No branches or pull requests

2 participants