New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(line): panic in truncated due to byte index not being a char boundary #1055
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1055 +/- ##
=====================================
Coverage 89.4% 89.4%
=====================================
Files 61 61
Lines 15468 15503 +35
=====================================
+ Hits 13839 13874 +35
Misses 1629 1629 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the added test cases and change of the relevant lines.
I added some simplification / improvement hints.
(I think the general code in this method is overcomplicated and should be simplified. Some math stuff should be reused like the check for max length. Adding offset and removing offset directly afterward resulting in pointless operations. And so on. But that's something for its own issue/pull request)
Performance wise… this always clones the content. When there is no need to truncate the span then reuse it directly without any clone. Not sure whether that should be part of this PR or its own PR too.
Co-authored-by: EdJoPaTo <github@edjopato.de>
Co-authored-by: EdJoPaTo <github@edjopato.de>
Co-authored-by: EdJoPaTo <github@edjopato.de>
Co-authored-by: EdJoPaTo <github@edjopato.de>
Suggestions are reasonable and I applied them all. Many thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that this will still break on characters that are wider than a single cell (e.g. the crab character is a good one to test with). To fix this will require looking at every character in .chars(), checking the Unicode width of the char, and keeping track of the number of characters so far.
There's a few edge cases:
- start position is halfway through a character - skip it
- end position is halfway through a character - skip it
- unit tests should show that there's no off by one issue on these boundary conditions
+1 to @EdJoPaTo's comment that this code is a bit complicated. Adding the right unit tests that clarify the behavior should help make it easy to refactor.
@joshka I am not sure why you want to inspect the Unicode length of the characters in char(), regardless of the unicode length, they should print as one character on screen. If skipping and taking the desired count of characters in char(), why should it abort? It cannot abort because of cutting into a UTF-8 character. That is taken care of by char() iterator? |
The specific code here is used to align the line based on the characters. Characters take up from zero to many columns when displayed. There were two incorrect assumptions made when implementing this:
You fixed the first point, but the second governs how text will be aligned. If there's four double width characters in a right aligned line, rendering into 10 wide area, the line should start at (10-4*2)=2, not (10-4)=6 There's a unit test in #1032 which should cover part of that (the one with the crabs). I'm not sure if the crab is double width or single off the top of my head. |
@joshka Sorry I was not aware that there actually are characters that print to more than one cell in the terminal. But yes, it makes sense. The first wrong assumption causes the application actually to panic, the second assumption causes the output of truncated() to be too long. I do not know if this would lead to a panic. But fixing the second wrong assumption means at least keeping track of the output string while iterating (if one goes the simple way of cutting the truncated string early). But to do it proper for right-aligned or center-aligned would even require a two-pass approach to find the proper cut position. I currently do not have an idea how to do this properly and efficient. |
I'm fine with keeping this open if you're motivated to continue with it once the truncation crate releases. |
Yeah sure 👍🏼 |
Looks like that crate implements the truncating algorithm correctly, and its test suite takes into account characters that have a width other than 1, so I'm all for using that crate |
Hm, but as a line contains multiple spans and they must be retained (styling, etc.), I am not sure we can simply use the mentioned crate. We would still need to determine for each span how it must be cut down. As the fn in the crate operates on a UTF-8 string as a whole. |
Yeah, the currently existing logic of handling the spans is still required. The crate only handles what’s inside a span.
yeah, happens often but in this case the author did something similar with the last PR so I was hopeful there. (Thank you @Aetf :) )
This PR isn’t useless and shouldn’t be closed. It includes the relevant tests and for its implementation we noticed it isn’t optimal yet so we keep on improving it until we are fine with it. That’s part of the PR. They aren’t always perfect from the start. So thank you @Thorongil80 for providing the PR and improving it :) We will get there :) |
Hello guys. I can confirm that unicode characters of length 1 and 2 work. |
and it got released with unicode-truncate 1.0.0: https://github.com/Aetf/unicode-truncate/releases/tag/v1.0.0 |
Mother of christ, pushing runs the whole test suite.. Nice ;-) |
Hi folks, I exchanged the lines with calls to unicode_truncate() and unicode_truncate_start() from the unicode_truncate crate. I note there is no function unicode_truncate_range() or something that cuts at the start and the end. So in the last loop iteration, where our offset and span length traverse output_width two calls are necessary. But this is only in one loop iteration, it breaks the loop afterwards. The convenient functions @EdJoPaTo added over there for truncating in alignment we cannot really use as we still have to deal with spans. The code determining the offset relative to alignment is using widths calculated already by unicode_width functions. I don't know if some comments would help in this code, it takes some time to understand. Cheers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without more context, it's difficult to understand what the tests are actually testing. What is even / odd? Why is it relevant to the tests? What specifically about utf8 is being tested here? I sort of understand the intent from the PR / bug (take a string that has characters that don't fall on byte indexes and truncate it correctly). I'm not sure I follow how the tests test that though (there's a missing link that's not obvious). I'm certain someone looking at this code in a few years time wouldn't be able to discern the intent.
I'd suggest making the multi-byte characters more obvious / documented in the test - it's reasonable to test this with a single byte numeric characters for the positions that don't matter (12345x6789) and use a specifically known multi-byte character only in the test position (like the crab emoji in #1032).
Additionally, it's worth remembering that there's two things to check here that will both cause a panic:
- If we're truncating 10 chars to fit in a buffer that is 4 characters wide, and one of those characters takes up two cells, what happens?
- If we're truncating 10 characters to fit into a buffer that is 4 characters wide and the 4th character takes up two bytes, what happens?
Additionally, a technique that's really really useful (and encouraged) when fixing a bug in released software is to make sure that the the specific bug is tested (or a minimal repro of it). Here the bug is that the truncation causes a panic due to rendering to a buffer area that is smaller than the truncated value. So the bug is felt by the user calling the render function, not by calling the truncate function. Testing truncate is good as a unit test, but make sure to include regression test(s) like the one in #1032 (Add a link to the Issue as well to highlight that the test is a regression test).
At 30 lines, it was above my taste for methods without explanation. At 40, it's well above that level. This is mainly because with every if statement / calculation comes a design decision. When the intent of those decisions isn't captured all we know is that the code acts a certain way because it acts a certain way. The code is now the spec, and understanding how the code is suppose to work requires reading every line of the code, keeping the state of the code in your head when reading it This is one of the general problems with iterative imperative code like this, it doesn't allow you to abstract checkpoint a partial state. I haven't spent time reading this yet, but I suspect there's some refactoring / reorganizing that could make it much clearer. That might be variable names, smaller named partial calculations, functions, etc. Perhaps start with the tests to ensure that you've got enough confidence that changing this doesn't break things, and then take a crack at addressing that. I'll take a look at doing this in the next couple of days if you're not able to simplify. |
I agree, will work out the tests better. The panic in #1032 was truncating the &str at a "forbidden position", not during rendering. The "even" and "odd" cutting means that in a str composed only of UTF-8 symbols encoded with two characters, guaranteed one of the both functions would surely hit a UTF-8 symbol at a position not allowed for cutting if implemented in the way it was before. The panic was not caused by rendering to a too small buffer. |
The truncate method is only called from the render function. It's not a public function.
Yes, you're right, however it's really easy to fix one part of this problem and break the other, as multi-byte characters tend to have a bit of overlap with mutlli-width characters.
Got it. Now I understand why you did that, but definitely think using a single multi-byte character is a simplification of the test. I'd be inclined to test that this fix works before, inside and after the character, not just before / inside. |
Is Multi-Width.-Character an issue here? All widths are calculated by unicode_width crate, as I understand. Maybe we need to devise a test specifically for this. But I guess it is solved. |
The truncation feature was added in 0.26.2 without considering either multi-byte or multi-width characters. I'm not confident that either case is correctly handled, and there are no tests for either that would provide that confidence. When truncating a string with multi-width characters, the truncation has to result in a string that can be rendered into the space that it will fit. If it doesn't, then it's possible that this will cause a panic by writing beyond the extent of the buffer (specifically on the right side, but I'd guess it's possible to trigger a panic with right aligned / centered text too). |
Indeed I found issues the code misbehaves. I added crabs. Truncating right-aligned with the crab in the right-most-position: the crab disappears. Should not be as the text is supposed to be right-aligned. Truncating right-aligned with the crab in the left-most-position: the crab is still rendered and the character in the right-most-position is omitted instead. So this code above pretends to truncate right-aligned or centered, but it seems to do this only for strings that do not contain double-width characters. As one can see from the code, of course, the operations on the spans are always left-aligned truncations, so we do not get any benefits from using the external crate that can do this properly. The algorithm needs to be reversed for right-alignment, so if one wants to retain the algorithm that works on the spans, they must be processed "from the right" or "from the left" accordingly and the truncation operations on the spans must cut from the left or from the right. If Centering, we then must decide on a preference. Are we "centering and omitting double-width-overflow on the right" or "centering and omitting double-width-overflow on the left" or are we "centering and omitting doible-width-overflow accordingly on both sides with integer division tending to the left" or something. This is ... disturbing ;-) |
For which way to center, there's an existing convention that likely shows up in tests somewhere that should be extended (or maintained). I recall having a conversation about it in an issue a while back. |
I am at a loss. I tried to refactor. Introduce helper functions for the three types of truncation.
Both cut the line (w single span) "Hello world" when called with "5" to "Hello". But I do not see why, the truncate_right does everything from the right end of the line. How can I get println!() to actually show up in the "cargo make test"? |
Tests only print their stdout when they fail so adding a |
You can also add I like to use this incantation when I'm running any test function name that contains the string
|
I took a stab at a different implementation of this in #1089 I removed the truncate method, instead rendering the spans one by one, and added a Span::split_at_width() method to help split left/right spans. My priority is on fixing the panics to help the downstream apps not fail here. |
@joshka awesome! I was considering proceeding with this work but... |
Yep. This should be out later today |
Fixed in #1089 |
Hi,
I think this will fix #1032
One needs to iterate chars() instead of slicing into the &str content itself. Obvious Rust thing with relation to UTF-8.
Added some truncation tests truncating UTF-8 strings left- and right-centered, even and odd numbers.
If there is a more performant way for things to do, please adjust accordingly.
Regards,
Thomas