fix(line): panic in truncated due to byte index not being a char boundary #1055

Thorongil80 · 2024-04-22T08:54:54Z

Hi,

I think this will fix #1032

One needs to iterate chars() instead of slicing into the &str content itself. Obvious Rust thing with relation to UTF-8.

Added some truncation tests truncating UTF-8 strings left- and right-centered, even and odd numbers.

If there is a more performant way for things to do, please adjust accordingly.

Regards,
Thomas

codecov · 2024-04-22T09:14:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.4%. Comparing base (f4637d4) to head (a844023).

❗ Current head a844023 differs from pull request most recent head de9cd6a. Consider uploading reports for the commit de9cd6a to get more accurate results

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #1055   +/-   ##
=====================================
  Coverage   89.4%   89.4%           
=====================================
  Files         61      61           
  Lines      15468   15503   +35     
=====================================
+ Hits       13839   13874   +35     
  Misses      1629    1629

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

EdJoPaTo

I like the added test cases and change of the relevant lines.

I added some simplification / improvement hints.

(I think the general code in this method is overcomplicated and should be simplified. Some math stuff should be reused like the check for max length. Adding offset and removing offset directly afterward resulting in pointless operations. And so on. But that's something for its own issue/pull request)

Performance wise… this always clones the content. When there is no need to truncate the span then reuse it directly without any clone. Not sure whether that should be part of this PR or its own PR too.

src/text/line.rs

Co-authored-by: EdJoPaTo <github@edjopato.de>

Thorongil80 · 2024-04-22T09:59:33Z

Suggestions are reasonable and I applied them all. Many thanks!

joshka

I suspect that this will still break on characters that are wider than a single cell (e.g. the crab character is a good one to test with). To fix this will require looking at every character in .chars(), checking the Unicode width of the char, and keeping track of the number of characters so far.

There's a few edge cases:

start position is halfway through a character - skip it
end position is halfway through a character - skip it
unit tests should show that there's no off by one issue on these boundary conditions

+1 to @EdJoPaTo's comment that this code is a bit complicated. Adding the right unit tests that clarify the behavior should help make it easy to refactor.

Thorongil80 · 2024-04-22T16:57:02Z

@joshka I am not sure why you want to inspect the Unicode length of the characters in char(), regardless of the unicode length, they should print as one character on screen. If skipping and taking the desired count of characters in char(), why should it abort? It cannot abort because of cutting into a UTF-8 character. That is taken care of by char() iterator?

joshka · 2024-04-23T03:10:47Z

@joshka I am not sure why you want to inspect the Unicode length of the characters in char(), regardless of the unicode length, they should print as one character on screen. If skipping and taking the desired count of characters in char(), why should it abort? It cannot abort because of cutting into a UTF-8 character. That is taken care of by char() iterator?

The specific code here is used to align the line based on the characters. Characters take up from zero to many columns when displayed. There were two incorrect assumptions made when implementing this:

Characters are a single byte
Characters take up a single cell when displayed.

You fixed the first point, but the second governs how text will be aligned. If there's four double width characters in a right aligned line, rendering into 10 wide area, the line should start at (10-4*2)=2, not (10-4)=6

There's a unit test in #1032 which should cover part of that (the one with the crabs). I'm not sure if the crab is double width or single off the top of my head.

Thorongil80 · 2024-04-23T05:10:04Z

@joshka Sorry I was not aware that there actually are characters that print to more than one cell in the terminal. But yes, it makes sense.

The first wrong assumption causes the application actually to panic, the second assumption causes the output of truncated() to be too long. I do not know if this would lead to a panic.

But fixing the second wrong assumption means at least keeping track of the output string while iterating (if one goes the simple way of cutting the truncated string early). But to do it proper for right-aligned or center-aligned would even require a two-pass approach to find the proper cut position.

I currently do not have an idea how to do this properly and efficient.

Thorongil80 · 2024-04-23T05:19:41Z

I found this

https://docs.rs/unicode-truncate/latest/unicode_truncate/

joshka · 2024-04-24T05:00:31Z

Feel free to close this PR, as may changes are only half of the cake anyways.

I'm fine with keeping this open if you're motivated to continue with it once the truncation crate releases.

Thorongil80 · 2024-04-24T05:20:42Z

Yeah sure 👍🏼

bugnano · 2024-04-24T05:31:01Z

If that crate solves this problem right, then we should definitely use it (though I'd suggest still making sure to have a small set of unit tests that make sure the behavior is right).

Looks like that crate implements the truncating algorithm correctly, and its test suite takes into account characters that have a width other than 1, so I'm all for using that crate

Thorongil80 · 2024-04-24T07:31:29Z

Hm, but as a line contains multiple spans and they must be retained (styling, etc.), I am not sure we can simply use the mentioned crate. We would still need to determine for each span how it must be cut down. As the fn in the crate operates on a UTF-8 string as a whole.

EdJoPaTo · 2024-04-24T08:09:10Z

Yeah, the currently existing logic of handling the spans is still required. The crate only handles what’s inside a span.

Neat. I expected that given the age of the crate, a PR might stagnate (crates without releases in the past ~6-12 months tend to suffer that problem), but that was surprisingly quick!

yeah, happens often but in this case the author did something similar with the last PR so I was hopeful there. (Thank you @Aetf :) )
And when the PR isn’t merged the code isn’t lost and can be integrated so there isn’t any work lost.

Feel free to close this PR, as may changes are only half of the cake anyways.

This PR isn’t useless and shouldn’t be closed. It includes the relevant tests and for its implementation we noticed it isn’t optimal yet so we keep on improving it until we are fine with it. That’s part of the PR. They aren’t always perfect from the start. So thank you @Thorongil80 for providing the PR and improving it :)

We will get there :)

FedericoBruzzone · 2024-04-25T12:40:20Z

Hello guys.
I'm working on a tui for telegram (link), and I came across this problem and so I was preparing a PR.
Now, having seen this discussion, I am pointing to @Thorongil80 's repository from cargo, waiting for a new release with this problem solved.

I can confirm that unicode characters of length 1 and 2 work.

EdJoPaTo · 2024-04-27T05:30:06Z

I played around with the mentioned crate and added a center alignment truncation which should be able to solve this. We could also integrate this into ratatui, but I think having this as a dedicated crate is also a good idea. Also, gitui seems to depend on it already.

Ratatui would only need to apply this to each of the spans then.

~~I am also interested in reviews over there at my PR when anyone is interested.~~ Got merged already.

and it got released with unicode-truncate 1.0.0: https://github.com/Aetf/unicode-truncate/releases/tag/v1.0.0

Thorongil80 · 2024-04-29T09:07:25Z

Mother of christ, pushing runs the whole test suite.. Nice ;-)

Thorongil80 · 2024-04-29T09:48:01Z

Hi folks,

I exchanged the lines with calls to unicode_truncate() and unicode_truncate_start() from the unicode_truncate crate.

I note there is no function unicode_truncate_range() or something that cuts at the start and the end. So in the last loop iteration, where our offset and span length traverse output_width two calls are necessary. But this is only in one loop iteration, it breaks the loop afterwards.

The convenient functions @EdJoPaTo added over there for truncating in alignment we cannot really use as we still have to deal with spans.

The code determining the offset relative to alignment is using widths calculated already by unicode_width functions.

I don't know if some comments would help in this code, it takes some time to understand.

Cheers,
Thomas

joshka

Without more context, it's difficult to understand what the tests are actually testing. What is even / odd? Why is it relevant to the tests? What specifically about utf8 is being tested here? I sort of understand the intent from the PR / bug (take a string that has characters that don't fall on byte indexes and truncate it correctly). I'm not sure I follow how the tests test that though (there's a missing link that's not obvious). I'm certain someone looking at this code in a few years time wouldn't be able to discern the intent.

I'd suggest making the multi-byte characters more obvious / documented in the test - it's reasonable to test this with a single byte numeric characters for the positions that don't matter (12345x6789) and use a specifically known multi-byte character only in the test position (like the crab emoji in #1032).

Additionally, it's worth remembering that there's two things to check here that will both cause a panic:

If we're truncating 10 chars to fit in a buffer that is 4 characters wide, and one of those characters takes up two cells, what happens?
If we're truncating 10 characters to fit into a buffer that is 4 characters wide and the 4th character takes up two bytes, what happens?

Additionally, a technique that's really really useful (and encouraged) when fixing a bug in released software is to make sure that the the specific bug is tested (or a minimal repro of it). Here the bug is that the truncation causes a panic due to rendering to a buffer area that is smaller than the truncated value. So the bug is felt by the user calling the render function, not by calling the truncate function. Testing truncate is good as a unit test, but make sure to include regression test(s) like the one in #1032 (Add a link to the Issue as well to highlight that the test is a regression test).

joshka · 2024-04-29T10:07:48Z

I don't know if some comments would help in this code, it takes some time to understand.

At 30 lines, it was above my taste for methods without explanation. At 40, it's well above that level. This is mainly because with every if statement / calculation comes a design decision. When the intent of those decisions isn't captured all we know is that the code acts a certain way because it acts a certain way. The code is now the spec, and understanding how the code is suppose to work requires reading every line of the code, keeping the state of the code in your head when reading it This is one of the general problems with iterative imperative code like this, it doesn't allow you to abstract checkpoint a partial state. </rant>

I haven't spent time reading this yet, but I suspect there's some refactoring / reorganizing that could make it much clearer. That might be variable names, smaller named partial calculations, functions, etc. Perhaps start with the tests to ensure that you've got enough confidence that changing this doesn't break things, and then take a crack at addressing that. I'll take a look at doing this in the next couple of days if you're not able to simplify.

Thorongil80 · 2024-04-29T10:08:40Z

Without more context, it's difficult to understand what the tests are actually testing. What is even / odd? Why is it relevant to the tests? What specifically about utf8 is being tested here? I sort of understand the intent from the PR / bug (take a string that has characters that don't fall on byte indexes and truncate it correctly). I'm not sure I follow how the tests test that though (there's a missing link that's not obvious). I'm certain someone looking at this code in a few years time wouldn't be able to discern the intent.

I'd suggest making the multi-byte characters more obvious / documented in the test - it's reasonable to test this with a single byte numeric characters for the positions that don't matter (12345x6789) and use a specifically known multi-byte character only in the test position (like the crab emoji in #1032).

Additionally, it's worth remembering that there's two things to check here that will both cause a panic:

If we're truncating 10 chars to fit in a buffer that is 4 characters wide, and one of those characters takes up two cells, what happens?

If we're truncating 10 characters to fit into a buffer that is 4 characters wide and the 4th character takes up two bytes, what happens?

Additionally, a technique that's really really useful (and encouraged) when fixing a bug in released software is to make sure that the the specific bug is tested (or a minimal repro of it). Here the bug is that the truncation causes a panic due to rendering to a buffer area that is smaller than the truncated value. So the bug is felt by the user calling the render function, not by calling the truncate function. Testing truncate is good as a unit test, but make sure to include regression test(s) like the one in #1032 (Add a link to the Issue as well to highlight that the test is a regression test).

I agree, will work out the tests better.

The panic in #1032 was truncating the &str at a "forbidden position", not during rendering. The "even" and "odd" cutting means that in a str composed only of UTF-8 symbols encoded with two characters, guaranteed one of the both functions would surely hit a UTF-8 symbol at a position not allowed for cutting if implemented in the way it was before. The panic was not caused by rendering to a too small buffer.

joshka · 2024-04-29T10:24:32Z

The panic in #1032 was truncating the &str at a "forbidden position", not during rendering.

The truncate method is only called from the render function. It's not a public function.

The panic was not caused by rendering to a too small buffer.

Yes, you're right, however it's really easy to fix one part of this problem and break the other, as multi-byte characters tend to have a bit of overlap with mutlli-width characters.

The "even" and "odd" cutting means that in a str composed only of UTF-8 symbols encoded with two characters, guaranteed one of the both functions would surely hit a UTF-8 symbol at a position not allowed for cutting if implemented in the way it was before.

Got it. Now I understand why you did that, but definitely think using a single multi-byte character is a simplification of the test. I'd be inclined to test that this fix works before, inside and after the character, not just before / inside.

Thorongil80 · 2024-04-29T10:37:19Z

Is Multi-Width.-Character an issue here? All widths are calculated by unicode_width crate, as I understand. Maybe we need to devise a test specifically for this. But I guess it is solved.

joshka · 2024-04-29T11:57:21Z

Is Multi-Width.-Character an issue here? All widths are calculated by unicode_width crate, as I understand. Maybe we need to devise a test specifically for this. But I guess it is solved.

The truncation feature was added in 0.26.2 without considering either multi-byte or multi-width characters. I'm not confident that either case is correctly handled, and there are no tests for either that would provide that confidence. When truncating a string with multi-width characters, the truncation has to result in a string that can be rendered into the space that it will fit. If it doesn't, then it's possible that this will cause a panic by writing beyond the extent of the buffer (specifically on the right side, but I'd guess it's possible to trigger a panic with right aligned / centered text too).

Thorongil80 · 2024-04-30T07:02:32Z

Indeed I found issues the code misbehaves. I added crabs.

Truncating right-aligned with the crab in the right-most-position: the crab disappears. Should not be as the text is supposed to be right-aligned.

Truncating right-aligned with the crab in the left-most-position: the crab is still rendered and the character in the right-most-position is omitted instead.

So this code above pretends to truncate right-aligned or centered, but it seems to do this only for strings that do not contain double-width characters. As one can see from the code, of course, the operations on the spans are always left-aligned truncations, so we do not get any benefits from using the external crate that can do this properly.

The algorithm needs to be reversed for right-alignment, so if one wants to retain the algorithm that works on the spans, they must be processed "from the right" or "from the left" accordingly and the truncation operations on the spans must cut from the left or from the right.

If Centering, we then must decide on a preference. Are we "centering and omitting double-width-overflow on the right" or "centering and omitting double-width-overflow on the left" or are we "centering and omitting doible-width-overflow accordingly on both sides with integer division tending to the left" or something.

This is ... disturbing ;-)

joshka · 2024-04-30T12:36:02Z

For which way to center, there's an existing convention that likely shows up in tests somewhere that should be extended (or maintained). I recall having a conversation about it in an issue a while back.

Thorongil80 · 2024-05-02T06:23:18Z

I am at a loss. I tried to refactor. Introduce helper functions for the three types of truncation.

   /// For truncating, broken down into three different modes, result_width is the display buffer
    /// width, i.e. double-width characters that take two cells in the display are considered to be
    /// of length two
    fn truncate_left(&self, result_width: usize) -> Line {
        let mut truncated_line = Line::default();
        let mut current_width: usize = 0;
        for span in &self.spans {
            if current_width + span.width() <= result_width {
                truncated_line.push_span(span.clone());
                current_width += span.width();
            } else {
                let (cut, len) = span.content.unicode_truncate(result_width - current_width);
                truncated_line.push_span(Span::styled(cut, span.style));
                current_width += len;
                break;
            }
        }
        truncated_line
    }

    /// For truncating, broken down into three different modes, result_width is the display buffer
    /// width, i.e. double-width characters that take two cells in the display are considered to be
    /// of length two
    fn truncate_right(&self, result_width: usize) -> Line {
        let mut truncated_line = Line::default();
        let mut current_width: usize = 0;
        for span in self.spans.iter().rev() {
            if current_width + span.width() <= result_width {
                truncated_line.spans.insert(0, span.clone());
                current_width += span.width();
            } else {
                let (cut, len) = span
                    .content
                    .unicode_truncate_aligned(result_width - current_width, Right);
                truncated_line
                    .spans
                    .insert(0, Span::styled(cut, span.style));
                current_width += len;
                break;
            }
        }
        truncated_line
    }

Both cut the line (w single span) "Hello world" when called with "5" to "Hello". But I do not see why, the truncate_right does everything from the right end of the line.

How can I get println!() to actually show up in the "cargo make test"?

EdJoPaTo · 2024-05-02T14:30:19Z

How can I get println!() to actually show up in the "cargo make test"?

Tests only print their stdout when they fail so adding a todo!() or something like that in the end shows their output

kdheepak · 2024-05-02T19:25:40Z

You can also add -- --nocapture and you'll be able to stdout.

I like to use this incantation when I'm running any test function name that contains the string test_render:

cargo test --lib test_render -- --nocapture

--lib only runs tests that are part of the ratatui library, so no tests from the tests folder
test_render is the string matches for in the function names of all the tests
-- --nocapture prints everything to stdout

joshka · 2024-05-06T07:35:35Z

I took a stab at a different implementation of this in #1089

I removed the truncate method, instead rendering the spans one by one, and added a Span::split_at_width() method to help split left/right spans.

My priority is on fixing the panics to help the downstream apps not fail here.

FedericoBruzzone · 2024-05-19T19:33:15Z

@joshka awesome!

I was considering proceeding with this work but...
Can you confirm that it makes sense to close this pull request and that you will fix this problem in the next release?

joshka · 2024-05-19T23:31:50Z

Yep. This should be out later today

joshka · 2024-05-21T06:31:58Z

Fixed in #1089

Thorongil80 added 2 commits April 22, 2024 10:30

Update line.rs

aa72aa5

Add UTF-8 truncation tests

f3a5333

Thorongil80 requested review from orhun, mindoodoo, sayanarijit, joshka, kdheepak, Valentin271 and EdJoPaTo as code owners April 22, 2024 08:54

Thorongil80 changed the title ~~Potentially fix~~ Potentially fix issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary" Apr 22, 2024

Thorongil80 changed the title ~~Potentially fix issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary"~~ fix: issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary" Apr 22, 2024

Thorongil80 added 3 commits April 22, 2024 11:04

Formatting fixes

635e742

formatting

9fe01d9

formatting

ec16e21

EdJoPaTo requested changes Apr 22, 2024

View reviewed changes

src/text/line.rs Outdated Show resolved Hide resolved

src/text/line.rs Outdated Show resolved Hide resolved

src/text/line.rs Outdated Show resolved Hide resolved

src/text/line.rs Outdated Show resolved Hide resolved

EdJoPaTo changed the title ~~fix: issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary"~~ fix(line): panic in truncated due to byte index not being a char boundary Apr 22, 2024

Thorongil80 and others added 4 commits April 22, 2024 11:57

Commit suggestion

3290bdc

Co-authored-by: EdJoPaTo <github@edjopato.de>

Update src/text/line.rs

d285120

Co-authored-by: EdJoPaTo <github@edjopato.de>

Update src/text/line.rs

1769292

Co-authored-by: EdJoPaTo <github@edjopato.de>

Update src/text/line.rs

0a4269f

Co-authored-by: EdJoPaTo <github@edjopato.de>

Thorongil80 requested a review from EdJoPaTo April 22, 2024 10:01

Thorongil80 added 2 commits April 22, 2024 12:06

typo

634e6f9

format

4324737

joshka reviewed Apr 22, 2024

View reviewed changes

Merge branch 'main' into main

2d0bc94

Thorongil80 and others added 2 commits April 29, 2024 09:09

Merge branch 'main' into main

4000273

use unicode_truncate for truncating

e12cc2e

formatting

a844023

joshka reviewed Apr 29, 2024

View reviewed changes

formatting

de9cd6a

joshka closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(line): panic in truncated due to byte index not being a char boundary #1055

fix(line): panic in truncated due to byte index not being a char boundary #1055

Thorongil80 commented Apr 22, 2024

codecov bot commented Apr 22, 2024 •

edited

EdJoPaTo left a comment

Thorongil80 commented Apr 22, 2024

joshka left a comment

Thorongil80 commented Apr 22, 2024

joshka commented Apr 23, 2024

Thorongil80 commented Apr 23, 2024 •

edited

Thorongil80 commented Apr 23, 2024

joshka commented Apr 24, 2024

Thorongil80 commented Apr 24, 2024

bugnano commented Apr 24, 2024

Thorongil80 commented Apr 24, 2024

EdJoPaTo commented Apr 24, 2024

FedericoBruzzone commented Apr 25, 2024 •

edited

EdJoPaTo commented Apr 27, 2024

Thorongil80 commented Apr 29, 2024

Thorongil80 commented Apr 29, 2024 •

edited

joshka left a comment

joshka commented Apr 29, 2024

Thorongil80 commented Apr 29, 2024

joshka commented Apr 29, 2024 •

edited

Thorongil80 commented Apr 29, 2024

joshka commented Apr 29, 2024

Thorongil80 commented Apr 30, 2024 •

edited

joshka commented Apr 30, 2024

Thorongil80 commented May 2, 2024 •

edited

EdJoPaTo commented May 2, 2024

kdheepak commented May 2, 2024

joshka commented May 6, 2024 •

edited

FedericoBruzzone commented May 19, 2024

joshka commented May 19, 2024

joshka commented May 21, 2024

fix(line): panic in truncated due to byte index not being a char boundary #1055

fix(line): panic in truncated due to byte index not being a char boundary #1055

Conversation

Thorongil80 commented Apr 22, 2024

codecov bot commented Apr 22, 2024 • edited

Codecov Report

EdJoPaTo left a comment

Choose a reason for hiding this comment

Thorongil80 commented Apr 22, 2024

joshka left a comment

Choose a reason for hiding this comment

Thorongil80 commented Apr 22, 2024

joshka commented Apr 23, 2024

Thorongil80 commented Apr 23, 2024 • edited

Thorongil80 commented Apr 23, 2024

joshka commented Apr 24, 2024

Thorongil80 commented Apr 24, 2024

bugnano commented Apr 24, 2024

Thorongil80 commented Apr 24, 2024

EdJoPaTo commented Apr 24, 2024

FedericoBruzzone commented Apr 25, 2024 • edited

EdJoPaTo commented Apr 27, 2024

Thorongil80 commented Apr 29, 2024

Thorongil80 commented Apr 29, 2024 • edited

joshka left a comment

Choose a reason for hiding this comment

joshka commented Apr 29, 2024

Thorongil80 commented Apr 29, 2024

joshka commented Apr 29, 2024 • edited

Thorongil80 commented Apr 29, 2024

joshka commented Apr 29, 2024

Thorongil80 commented Apr 30, 2024 • edited

joshka commented Apr 30, 2024

Thorongil80 commented May 2, 2024 • edited

EdJoPaTo commented May 2, 2024

kdheepak commented May 2, 2024

joshka commented May 6, 2024 • edited

FedericoBruzzone commented May 19, 2024

joshka commented May 19, 2024

joshka commented May 21, 2024

codecov bot commented Apr 22, 2024 •

edited

Thorongil80 commented Apr 23, 2024 •

edited

FedericoBruzzone commented Apr 25, 2024 •

edited

Thorongil80 commented Apr 29, 2024 •

edited

joshka commented Apr 29, 2024 •

edited

Thorongil80 commented Apr 30, 2024 •

edited

Thorongil80 commented May 2, 2024 •

edited

joshka commented May 6, 2024 •

edited