Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(line): panic in truncated due to byte index not being a char boundary #1055

Closed
wants to merge 16 commits into from

Conversation

Thorongil80
Copy link

Hi,

I think this will fix #1032

One needs to iterate chars() instead of slicing into the &str content itself. Obvious Rust thing with relation to UTF-8.

Added some truncation tests truncating UTF-8 strings left- and right-centered, even and odd numbers.

If there is a more performant way for things to do, please adjust accordingly.

Regards,
Thomas

@Thorongil80 Thorongil80 changed the title Potentially fix Potentially fix issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary" Apr 22, 2024
@Thorongil80 Thorongil80 changed the title Potentially fix issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary" fix: issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary" Apr 22, 2024
Copy link

codecov bot commented Apr 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.4%. Comparing base (f4637d4) to head (a844023).

❗ Current head a844023 differs from pull request most recent head de9cd6a. Consider uploading reports for the commit de9cd6a to get more accurate results

Additional details and impacted files
@@          Coverage Diff          @@
##            main   #1055   +/-   ##
=====================================
  Coverage   89.4%   89.4%           
=====================================
  Files         61      61           
  Lines      15468   15503   +35     
=====================================
+ Hits       13839   13874   +35     
  Misses      1629    1629           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@EdJoPaTo EdJoPaTo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the added test cases and change of the relevant lines.

I added some simplification / improvement hints.

(I think the general code in this method is overcomplicated and should be simplified. Some math stuff should be reused like the check for max length. Adding offset and removing offset directly afterward resulting in pointless operations. And so on. But that's something for its own issue/pull request)

Performance wise… this always clones the content. When there is no need to truncate the span then reuse it directly without any clone. Not sure whether that should be part of this PR or its own PR too.

src/text/line.rs Outdated Show resolved Hide resolved
src/text/line.rs Outdated Show resolved Hide resolved
src/text/line.rs Outdated Show resolved Hide resolved
src/text/line.rs Outdated Show resolved Hide resolved
@EdJoPaTo EdJoPaTo changed the title fix: issue #1032, "panicked in Line::truncated method due to byte index not being a char boundary" fix(line): panic in truncated due to byte index not being a char boundary Apr 22, 2024
Thorongil80 and others added 4 commits April 22, 2024 11:57
Co-authored-by: EdJoPaTo <github@edjopato.de>
Co-authored-by: EdJoPaTo <github@edjopato.de>
Co-authored-by: EdJoPaTo <github@edjopato.de>
Co-authored-by: EdJoPaTo <github@edjopato.de>
@Thorongil80
Copy link
Author

Suggestions are reasonable and I applied them all. Many thanks!

Copy link
Member

@joshka joshka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this will still break on characters that are wider than a single cell (e.g. the crab character is a good one to test with). To fix this will require looking at every character in .chars(), checking the Unicode width of the char, and keeping track of the number of characters so far.

There's a few edge cases:

  • start position is halfway through a character - skip it
  • end position is halfway through a character - skip it
  • unit tests should show that there's no off by one issue on these boundary conditions

+1 to @EdJoPaTo's comment that this code is a bit complicated. Adding the right unit tests that clarify the behavior should help make it easy to refactor.

@Thorongil80
Copy link
Author

@joshka I am not sure why you want to inspect the Unicode length of the characters in char(), regardless of the unicode length, they should print as one character on screen. If skipping and taking the desired count of characters in char(), why should it abort? It cannot abort because of cutting into a UTF-8 character. That is taken care of by char() iterator?

@joshka
Copy link
Member

joshka commented Apr 23, 2024

@joshka I am not sure why you want to inspect the Unicode length of the characters in char(), regardless of the unicode length, they should print as one character on screen. If skipping and taking the desired count of characters in char(), why should it abort? It cannot abort because of cutting into a UTF-8 character. That is taken care of by char() iterator?

The specific code here is used to align the line based on the characters. Characters take up from zero to many columns when displayed. There were two incorrect assumptions made when implementing this:

  1. Characters are a single byte
  2. Characters take up a single cell when displayed.

You fixed the first point, but the second governs how text will be aligned. If there's four double width characters in a right aligned line, rendering into 10 wide area, the line should start at (10-4*2)=2, not (10-4)=6

There's a unit test in #1032 which should cover part of that (the one with the crabs). I'm not sure if the crab is double width or single off the top of my head.

@Thorongil80
Copy link
Author

Thorongil80 commented Apr 23, 2024

@joshka Sorry I was not aware that there actually are characters that print to more than one cell in the terminal. But yes, it makes sense.

The first wrong assumption causes the application actually to panic, the second assumption causes the output of truncated() to be too long. I do not know if this would lead to a panic.

But fixing the second wrong assumption means at least keeping track of the output string while iterating (if one goes the simple way of cutting the truncated string early). But to do it proper for right-aligned or center-aligned would even require a two-pass approach to find the proper cut position.

I currently do not have an idea how to do this properly and efficient.

@Thorongil80
Copy link
Author

@joshka
Copy link
Member

joshka commented Apr 24, 2024

Feel free to close this PR, as may changes are only half of the cake anyways.

I'm fine with keeping this open if you're motivated to continue with it once the truncation crate releases.

@Thorongil80
Copy link
Author

Yeah sure 👍🏼

@bugnano
Copy link

bugnano commented Apr 24, 2024

If that crate solves this problem right, then we should definitely use it (though I'd suggest still making sure to have a small set of unit tests that make sure the behavior is right).

Looks like that crate implements the truncating algorithm correctly, and its test suite takes into account characters that have a width other than 1, so I'm all for using that crate

@Thorongil80
Copy link
Author

Hm, but as a line contains multiple spans and they must be retained (styling, etc.), I am not sure we can simply use the mentioned crate. We would still need to determine for each span how it must be cut down. As the fn in the crate operates on a UTF-8 string as a whole.

@EdJoPaTo
Copy link
Member

Yeah, the currently existing logic of handling the spans is still required. The crate only handles what’s inside a span.

Neat. I expected that given the age of the crate, a PR might stagnate (crates without releases in the past ~6-12 months tend to suffer that problem), but that was surprisingly quick!

yeah, happens often but in this case the author did something similar with the last PR so I was hopeful there. (Thank you @Aetf :) )
And when the PR isn’t merged the code isn’t lost and can be integrated so there isn’t any work lost.

Feel free to close this PR, as may changes are only half of the cake anyways.

This PR isn’t useless and shouldn’t be closed. It includes the relevant tests and for its implementation we noticed it isn’t optimal yet so we keep on improving it until we are fine with it. That’s part of the PR. They aren’t always perfect from the start. So thank you @Thorongil80 for providing the PR and improving it :)

We will get there :)

@FedericoBruzzone
Copy link

FedericoBruzzone commented Apr 25, 2024

Hello guys.
I'm working on a tui for telegram (link), and I came across this problem and so I was preparing a PR.
Now, having seen this discussion, I am pointing to @Thorongil80 's repository from cargo, waiting for a new release with this problem solved.

I can confirm that unicode characters of length 1 and 2 work.

@EdJoPaTo
Copy link
Member

I played around with the mentioned crate and added a center alignment truncation which should be able to solve this. We could also integrate this into ratatui, but I think having this as a dedicated crate is also a good idea. Also, gitui seems to depend on it already.

Ratatui would only need to apply this to each of the spans then.

I am also interested in reviews over there at my PR when anyone is interested. Got merged already.

and it got released with unicode-truncate 1.0.0: https://github.com/Aetf/unicode-truncate/releases/tag/v1.0.0

@Thorongil80
Copy link
Author

Mother of christ, pushing runs the whole test suite.. Nice ;-)

@Thorongil80
Copy link
Author

Thorongil80 commented Apr 29, 2024

Hi folks,

I exchanged the lines with calls to unicode_truncate() and unicode_truncate_start() from the unicode_truncate crate.

I note there is no function unicode_truncate_range() or something that cuts at the start and the end. So in the last loop iteration, where our offset and span length traverse output_width two calls are necessary. But this is only in one loop iteration, it breaks the loop afterwards.

The convenient functions @EdJoPaTo added over there for truncating in alignment we cannot really use as we still have to deal with spans.

The code determining the offset relative to alignment is using widths calculated already by unicode_width functions.

I don't know if some comments would help in this code, it takes some time to understand.

Cheers,
Thomas

Copy link
Member

@joshka joshka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without more context, it's difficult to understand what the tests are actually testing. What is even / odd? Why is it relevant to the tests? What specifically about utf8 is being tested here? I sort of understand the intent from the PR / bug (take a string that has characters that don't fall on byte indexes and truncate it correctly). I'm not sure I follow how the tests test that though (there's a missing link that's not obvious). I'm certain someone looking at this code in a few years time wouldn't be able to discern the intent.

I'd suggest making the multi-byte characters more obvious / documented in the test - it's reasonable to test this with a single byte numeric characters for the positions that don't matter (12345x6789) and use a specifically known multi-byte character only in the test position (like the crab emoji in #1032).

Additionally, it's worth remembering that there's two things to check here that will both cause a panic:

  • If we're truncating 10 chars to fit in a buffer that is 4 characters wide, and one of those characters takes up two cells, what happens?
  • If we're truncating 10 characters to fit into a buffer that is 4 characters wide and the 4th character takes up two bytes, what happens?

Additionally, a technique that's really really useful (and encouraged) when fixing a bug in released software is to make sure that the the specific bug is tested (or a minimal repro of it). Here the bug is that the truncation causes a panic due to rendering to a buffer area that is smaller than the truncated value. So the bug is felt by the user calling the render function, not by calling the truncate function. Testing truncate is good as a unit test, but make sure to include regression test(s) like the one in #1032 (Add a link to the Issue as well to highlight that the test is a regression test).

@joshka
Copy link
Member

joshka commented Apr 29, 2024

I don't know if some comments would help in this code, it takes some time to understand.

At 30 lines, it was above my taste for methods without explanation. At 40, it's well above that level. This is mainly because with every if statement / calculation comes a design decision. When the intent of those decisions isn't captured all we know is that the code acts a certain way because it acts a certain way. The code is now the spec, and understanding how the code is suppose to work requires reading every line of the code, keeping the state of the code in your head when reading it This is one of the general problems with iterative imperative code like this, it doesn't allow you to abstract checkpoint a partial state. </rant>

I haven't spent time reading this yet, but I suspect there's some refactoring / reorganizing that could make it much clearer. That might be variable names, smaller named partial calculations, functions, etc. Perhaps start with the tests to ensure that you've got enough confidence that changing this doesn't break things, and then take a crack at addressing that. I'll take a look at doing this in the next couple of days if you're not able to simplify.

@Thorongil80
Copy link
Author

Without more context, it's difficult to understand what the tests are actually testing. What is even / odd? Why is it relevant to the tests? What specifically about utf8 is being tested here? I sort of understand the intent from the PR / bug (take a string that has characters that don't fall on byte indexes and truncate it correctly). I'm not sure I follow how the tests test that though (there's a missing link that's not obvious). I'm certain someone looking at this code in a few years time wouldn't be able to discern the intent.

I'd suggest making the multi-byte characters more obvious / documented in the test - it's reasonable to test this with a single byte numeric characters for the positions that don't matter (12345x6789) and use a specifically known multi-byte character only in the test position (like the crab emoji in #1032).

Additionally, it's worth remembering that there's two things to check here that will both cause a panic:

  • If we're truncating 10 chars to fit in a buffer that is 4 characters wide, and one of those characters takes up two cells, what happens?
  • If we're truncating 10 characters to fit into a buffer that is 4 characters wide and the 4th character takes up two bytes, what happens?

Additionally, a technique that's really really useful (and encouraged) when fixing a bug in released software is to make sure that the the specific bug is tested (or a minimal repro of it). Here the bug is that the truncation causes a panic due to rendering to a buffer area that is smaller than the truncated value. So the bug is felt by the user calling the render function, not by calling the truncate function. Testing truncate is good as a unit test, but make sure to include regression test(s) like the one in #1032 (Add a link to the Issue as well to highlight that the test is a regression test).

I agree, will work out the tests better.

The panic in #1032 was truncating the &str at a "forbidden position", not during rendering. The "even" and "odd" cutting means that in a str composed only of UTF-8 symbols encoded with two characters, guaranteed one of the both functions would surely hit a UTF-8 symbol at a position not allowed for cutting if implemented in the way it was before. The panic was not caused by rendering to a too small buffer.

@joshka
Copy link
Member

joshka commented Apr 29, 2024

The panic in #1032 was truncating the &str at a "forbidden position", not during rendering.

The truncate method is only called from the render function. It's not a public function.

The panic was not caused by rendering to a too small buffer.

Yes, you're right, however it's really easy to fix one part of this problem and break the other, as multi-byte characters tend to have a bit of overlap with mutlli-width characters.

The "even" and "odd" cutting means that in a str composed only of UTF-8 symbols encoded with two characters, guaranteed one of the both functions would surely hit a UTF-8 symbol at a position not allowed for cutting if implemented in the way it was before.

Got it. Now I understand why you did that, but definitely think using a single multi-byte character is a simplification of the test. I'd be inclined to test that this fix works before, inside and after the character, not just before / inside.

@Thorongil80
Copy link
Author

Is Multi-Width.-Character an issue here? All widths are calculated by unicode_width crate, as I understand. Maybe we need to devise a test specifically for this. But I guess it is solved.

@joshka
Copy link
Member

joshka commented Apr 29, 2024

Is Multi-Width.-Character an issue here? All widths are calculated by unicode_width crate, as I understand. Maybe we need to devise a test specifically for this. But I guess it is solved.

The truncation feature was added in 0.26.2 without considering either multi-byte or multi-width characters. I'm not confident that either case is correctly handled, and there are no tests for either that would provide that confidence. When truncating a string with multi-width characters, the truncation has to result in a string that can be rendered into the space that it will fit. If it doesn't, then it's possible that this will cause a panic by writing beyond the extent of the buffer (specifically on the right side, but I'd guess it's possible to trigger a panic with right aligned / centered text too).

@Thorongil80
Copy link
Author

Thorongil80 commented Apr 30, 2024

Indeed I found issues the code misbehaves. I added crabs.

Truncating right-aligned with the crab in the right-most-position: the crab disappears. Should not be as the text is supposed to be right-aligned.

Truncating right-aligned with the crab in the left-most-position: the crab is still rendered and the character in the right-most-position is omitted instead.

So this code above pretends to truncate right-aligned or centered, but it seems to do this only for strings that do not contain double-width characters. As one can see from the code, of course, the operations on the spans are always left-aligned truncations, so we do not get any benefits from using the external crate that can do this properly.

The algorithm needs to be reversed for right-alignment, so if one wants to retain the algorithm that works on the spans, they must be processed "from the right" or "from the left" accordingly and the truncation operations on the spans must cut from the left or from the right.

If Centering, we then must decide on a preference. Are we "centering and omitting double-width-overflow on the right" or "centering and omitting double-width-overflow on the left" or are we "centering and omitting doible-width-overflow accordingly on both sides with integer division tending to the left" or something.

This is ... disturbing ;-)

@joshka
Copy link
Member

joshka commented Apr 30, 2024

For which way to center, there's an existing convention that likely shows up in tests somewhere that should be extended (or maintained). I recall having a conversation about it in an issue a while back.

@Thorongil80
Copy link
Author

Thorongil80 commented May 2, 2024

I am at a loss. I tried to refactor. Introduce helper functions for the three types of truncation.

   /// For truncating, broken down into three different modes, result_width is the display buffer
    /// width, i.e. double-width characters that take two cells in the display are considered to be
    /// of length two
    fn truncate_left(&self, result_width: usize) -> Line {
        let mut truncated_line = Line::default();
        let mut current_width: usize = 0;
        for span in &self.spans {
            if current_width + span.width() <= result_width {
                truncated_line.push_span(span.clone());
                current_width += span.width();
            } else {
                let (cut, len) = span.content.unicode_truncate(result_width - current_width);
                truncated_line.push_span(Span::styled(cut, span.style));
                current_width += len;
                break;
            }
        }
        truncated_line
    }

    /// For truncating, broken down into three different modes, result_width is the display buffer
    /// width, i.e. double-width characters that take two cells in the display are considered to be
    /// of length two
    fn truncate_right(&self, result_width: usize) -> Line {
        let mut truncated_line = Line::default();
        let mut current_width: usize = 0;
        for span in self.spans.iter().rev() {
            if current_width + span.width() <= result_width {
                truncated_line.spans.insert(0, span.clone());
                current_width += span.width();
            } else {
                let (cut, len) = span
                    .content
                    .unicode_truncate_aligned(result_width - current_width, Right);
                truncated_line
                    .spans
                    .insert(0, Span::styled(cut, span.style));
                current_width += len;
                break;
            }
        }
        truncated_line
    }

Both cut the line (w single span) "Hello world" when called with "5" to "Hello". But I do not see why, the truncate_right does everything from the right end of the line.

How can I get println!() to actually show up in the "cargo make test"?

@EdJoPaTo
Copy link
Member

EdJoPaTo commented May 2, 2024

How can I get println!() to actually show up in the "cargo make test"?

Tests only print their stdout when they fail so adding a todo!() or something like that in the end shows their output

@kdheepak
Copy link
Collaborator

kdheepak commented May 2, 2024

You can also add -- --nocapture and you'll be able to stdout.

I like to use this incantation when I'm running any test function name that contains the string test_render:

cargo test --lib test_render -- --nocapture
  • --lib only runs tests that are part of the ratatui library, so no tests from the tests folder
  • test_render is the string matches for in the function names of all the tests
  • -- --nocapture prints everything to stdout

@joshka
Copy link
Member

joshka commented May 6, 2024

I took a stab at a different implementation of this in #1089

I removed the truncate method, instead rendering the spans one by one, and added a Span::split_at_width() method to help split left/right spans.

My priority is on fixing the panics to help the downstream apps not fail here.

@FedericoBruzzone
Copy link

@joshka awesome!

I was considering proceeding with this work but...
Can you confirm that it makes sense to close this pull request and that you will fix this problem in the next release?

@joshka
Copy link
Member

joshka commented May 19, 2024

Yep. This should be out later today

@joshka
Copy link
Member

joshka commented May 21, 2024

Fixed in #1089

@joshka joshka closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panicked in Line::truncated method due to byte index not being a char boundary
6 participants