Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for full-width characters #41

Open
saihaze opened this issue Jul 29, 2022 · 9 comments
Open

Adding support for full-width characters #41

saihaze opened this issue Jul 29, 2022 · 9 comments

Comments

@saihaze
Copy link

saihaze commented Jul 29, 2022

When the source code full-width characters (e.g. CJK characters), underlining is not proper. They might need too spaces.

@zesterer
Copy link
Owner

Do you have a screenshot + sample text example of this?

@saihaze
Copy link
Author

saihaze commented Aug 2, 2022

Let's take examples/simple.rs as an example. It is supposed to be like:
截图 2022-08-02 15-45-58
But when it includes a Chinese character, it looks like this:
截图 2022-08-02 15-46-37

@zesterer
Copy link
Owner

zesterer commented Aug 2, 2022

How much did you increment the spans by when doing so? And does Rust's .chars() count the character as a single character?

@saihaze
Copy link
Author

saihaze commented Aug 2, 2022

Usually a CJK character (a character used in China, Japan and Korea) takes the space of two ASCII characters. Rust's .chars() does count it as a single character. Perhaps you can use crate unicode_width instead.
I've never seen a wider character, though. But maybe you can take them into account, too.

@saihaze
Copy link
Author

saihaze commented Aug 2, 2022

图片

This is my own diagnostics module. It is simple, but might be an example.

@zesterer
Copy link
Owner

zesterer commented Aug 3, 2022

I'll put it on the list of things to do when I get round to refactoring the crate's internals. Contributions are also welcome of course!

@ratmice
Copy link
Contributor

ratmice commented Sep 14, 2022

Going to break my comment in the above PR out here:

The PR above fixed some of these issues, but probably not enough for CJK.
I imagine what is needed should be fairly simple, I just don't know the right things to do regarding whitespace on CJK.
I think all that is left should be something to the effect of:

  1. Add a flag to Config builder for enabling CJK.
  2. When that flag is enabled call width_cjk instead of width from unicode_width from lib.rs: char_width().
  3. Figure out what the width should be for whitespace where it is currently hard coded to 1?

@TRCYX
Copy link
Contributor

TRCYX commented Jan 25, 2023

As a Chinese, I have not seen "ambiguous characters" rendering wide for years. The width_cjk thing might not be necessary. From the documentation of width_cjk, I can refer to UAX#11, which says that modern practice tend to render ambiguous characters narrow. The UAX recommends that we only render them wide when we reliably know that the context wants them to be wide. At least from what I see, this does not happen a lot nowadays, especially when related to formal languages where an error could be pointed out.

But the hardcoded width of whitespace is a problem, for there is at least a wide space:  . The current approach renders it as a narrow space. I am not sure about why all whitespace converts to \x20 with width 1 in char_width:

ariadne/src/lib.rs

Lines 401 to 413 in 06b1748

// Find the character that should be drawn and the number of times it should be drawn for each char
fn char_width(&self, c: char, col: usize) -> (char, usize) {
match c {
'\t' => {
// Find the column that the tab should end at
let tab_end = (col / self.tab_width + 1) * self.tab_width;
(' ', tab_end - col)
},
c if c.is_whitespace() => (' ', 1),
_ => (c, c.width().unwrap_or(1)),
}
}
}

Regarding this, I come up with two possible solutions.

  1. We could simply remove this case and make a whitespace character represent itself. Its width would be correctly calculated. But in write.rs:

    ariadne/src/write.rs

    Lines 585 to 599 in 06b1748

    for (col, c) in line.chars().enumerate() {
    let color = if let Some(highlight) = get_highlight(col) {
    highlight.color
    } else {
    self.config.unimportant_color()
    };
    let (c, width) = self.config.char_width(c, col);
    if c.is_whitespace() {
    for _ in 0..width {
    write!(w, "{}", c.fg(color, s))?;
    }
    } else {
    write!(w, "{}", c.fg(color, s))?;
    };
    }

    we do not render whitespaces multiple times; instead we could test if the original c before calling char_width was \t, or we could repeat only \x20 instead of all whitespaces.
  2. Or we keep converting whitespaces to \x20, but calculate its width instead of filling in 1:
            c if c.is_whitespace() => (' ', c.width().unwrap_or(1)),

Both solutions should work well with  , but I believe a clarification about the point of converting whitespaces to \x20 is needed. I prefer the former, since it actually keeps that wide space. If the terminal shows, for example, a dot, where there is a space, the user could possibly see some differences.

@ratmice
Copy link
Contributor

ratmice commented Jan 25, 2023

I prefer the former, since it actually keeps that wide space. If the terminal shows, for example, a dot, where there is a space, the user could possibly see some differences.

Might just be useless trivia that won't ocur in practice, but another corner case/point in favor of doing that is U+1680 Ogham space mark, a character with a visible representation that is_whitespace(). https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=cb4c89643c03d3557d336623b242912f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants