Adding support for full-width characters #41

saihaze · 2022-07-29T01:18:12Z

When the source code full-width characters (e.g. CJK characters), underlining is not proper. They might need too spaces.

zesterer · 2022-07-29T07:56:20Z

Do you have a screenshot + sample text example of this?

saihaze · 2022-08-02T07:49:38Z

Let's take examples/simple.rs as an example. It is supposed to be like:

But when it includes a Chinese character, it looks like this:

zesterer · 2022-08-02T09:14:57Z

How much did you increment the spans by when doing so? And does Rust's .chars() count the character as a single character?

saihaze · 2022-08-02T13:49:20Z

Usually a CJK character (a character used in China, Japan and Korea) takes the space of two ASCII characters. Rust's .chars() does count it as a single character. Perhaps you can use crate unicode_width instead.
I've never seen a wider character, though. But maybe you can take them into account, too.

saihaze · 2022-08-02T13:51:21Z

This is my own diagnostics module. It is simple, but might be an example.

zesterer · 2022-08-03T09:21:53Z

I'll put it on the list of things to do when I get round to refactoring the crate's internals. Contributions are also welcome of course!

ratmice · 2022-09-14T16:03:05Z

Going to break my comment in the above PR out here:

The PR above fixed some of these issues, but probably not enough for CJK.
I imagine what is needed should be fairly simple, I just don't know the right things to do regarding whitespace on CJK.
I think all that is left should be something to the effect of:

Add a flag to Config builder for enabling CJK.
When that flag is enabled call width_cjk instead of width from unicode_width from lib.rs: char_width().
Figure out what the width should be for whitespace where it is currently hard coded to 1?

TRCYX · 2023-01-25T19:33:55Z

As a Chinese, I have not seen "ambiguous characters" rendering wide for years. The width_cjk thing might not be necessary. From the documentation of width_cjk, I can refer to UAX#11, which says that modern practice tend to render ambiguous characters narrow. The UAX recommends that we only render them wide when we reliably know that the context wants them to be wide. At least from what I see, this does not happen a lot nowadays, especially when related to formal languages where an error could be pointed out.

But the hardcoded width of whitespace is a problem, for there is at least a wide space: 　. The current approach renders it as a narrow space. I am not sure about why all whitespace converts to \x20 with width 1 in char_width:

ariadne/src/lib.rs

Lines 401 to 413 in 06b1748

    
               // Find the character that should be drawn and the number of times it should be drawn for each char 
        
               fn char_width(&self, c: char, col: usize) -> (char, usize) { 
        
                   match c { 
        
                       '\t' => { 
        
                           // Find the column that the tab should end at 
        
                           let tab_end = (col / self.tab_width + 1) * self.tab_width; 
        
                           (' ',  tab_end - col) 
        
                       }, 
        
                       c if c.is_whitespace() => (' ', 1), 
        
                       _ => (c, c.width().unwrap_or(1)), 
        
                   } 
        
               } 
        
           }

Regarding this, I come up with two possible solutions.

We could simply remove this case and make a whitespace character represent itself. Its width would be correctly calculated. But in write.rs:

ariadne/src/write.rs

Lines 585 to 599 in 06b1748

    
           for (col, c) in line.chars().enumerate() { 
        
               let color = if let Some(highlight) = get_highlight(col) { 
        
                   highlight.color 
        
               } else { 
        
                   self.config.unimportant_color() 
        
               }; 
        
               let (c, width) = self.config.char_width(c, col); 
        
               if c.is_whitespace() { 
        
                   for _ in 0..width { 
        
                       write!(w, "{}", c.fg(color, s))?; 
        
                   } 
        
               } else { 
        
                   write!(w, "{}", c.fg(color, s))?; 
        
               }; 
        
           }

we do not render whitespaces multiple times; instead we could test if the original c before calling char_width was \t, or we could repeat only \x20 instead of all whitespaces.

Or we keep converting whitespaces to \x20, but calculate its width instead of filling in 1:

            c if c.is_whitespace() => (' ', c.width().unwrap_or(1)),

Both solutions should work well with 　, but I believe a clarification about the point of converting whitespaces to \x20 is needed. I prefer the former, since it actually keeps that wide space. If the terminal shows, for example, a dot, where there is a space, the user could possibly see some differences.

ratmice · 2023-01-25T20:05:32Z

I prefer the former, since it actually keeps that wide space. If the terminal shows, for example, a dot, where there is a space, the user could possibly see some differences.

Might just be useless trivia that won't ocur in practice, but another corner case/point in favor of doing that is U+1680 Ogham space mark, a character with a visible representation that is_whitespace(). https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=cb4c89643c03d3557d336623b242912f

ratmice mentioned this issue Sep 9, 2022

Add support for unicode width #46

Merged

brockelmore mentioned this issue Apr 18, 2023

Allow Spans to be byte offset based instead of char offset based #71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for full-width characters #41

Adding support for full-width characters #41

saihaze commented Jul 29, 2022

zesterer commented Jul 29, 2022

saihaze commented Aug 2, 2022

zesterer commented Aug 2, 2022

saihaze commented Aug 2, 2022

saihaze commented Aug 2, 2022 •

edited

zesterer commented Aug 3, 2022

ratmice commented Sep 14, 2022

TRCYX commented Jan 25, 2023

ratmice commented Jan 25, 2023

Adding support for full-width characters #41

Adding support for full-width characters #41

Comments

saihaze commented Jul 29, 2022

zesterer commented Jul 29, 2022

saihaze commented Aug 2, 2022

zesterer commented Aug 2, 2022

saihaze commented Aug 2, 2022

saihaze commented Aug 2, 2022 • edited

zesterer commented Aug 3, 2022

ratmice commented Sep 14, 2022

TRCYX commented Jan 25, 2023

ratmice commented Jan 25, 2023

saihaze commented Aug 2, 2022 •

edited