Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word differ returns out-of-bounds column numbers #699

Open
sesse opened this issue Apr 9, 2024 · 1 comment
Open

Word differ returns out-of-bounds column numbers #699

sesse opened this issue Apr 9, 2024 · 1 comment

Comments

@sesse
Copy link
Contributor

sesse commented Apr 9, 2024

It seems that when a large token is split into multiple Syntax atoms, and that token spans multiple lines, column numbers that are out-of-bounds can be created (this may be a bug in the line-numbers crate).

To reproduce, run difftastic on old.xml and new.xml (they are from the Chromium repository). Files in files.tar.gz

You will get:
image

The badness happens somewhere in split_atom_words(), and I would assume that it is related to the comment “TODO: don't assume this atom is on a single line”. For instance, we have a single atom '.' at some point, at offset 67. We have pos that looks like this (though we only use the first element, pos[0]):

[
  SingleLineSpan { line: LineNumber: 1102 (zero-indexed: 1101), start_col: 0, end_col: 80 },
  SingleLineSpan { line: LineNumber: 1103 (zero-indexed: 1102), start_col: 0, end_col: 64 },
  SingleLineSpan { line: LineNumber: 1104 (zero-indexed: 1103), start_col: 0, end_col: 79 },
  SingleLineSpan { line: LineNumber: 1105 (zero-indexed: 1104), start_col: 0, end_col: 15 },
  SingleLineSpan { line: LineNumber: 1106 (zero-indexed: 1105), start_col: 0, end_col: 2 }
] 

What we get back, and put in word_pos, is:

[SingleLineSpan { line: LineNumber: 1103 (zero-indexed: 1102), start_col: 66, end_col: 67 }]

This is nonsensical; line 1103 has only 64 columns, so column “66 to 67” doesn't exist and the diff gets confused. But on column 67 on line 1102, there is indeed a period. Why it ends up with line 1103 is a bit beyond me, because it's not entirely clear to me exactly what from_region_relative_to is supposed to be doing.

@sesse
Copy link
Contributor Author

sesse commented Apr 9, 2024

Aha:

        // Don't create a MatchedPos for empty positions at the start                                                                                 
        // or end. We still want empty positions in the middle of                                                                                     
        // multiline atoms, as a multiline string literal may include                                                                                 
        // empty lines.                                                                                                                               
        let pos = filter_empty_ends(pos);  

This ends badly, since the content of the tag starts with a newline and there is no corresponding filtering of that. So we take the byte offset of the . character, find its line:col within \nLINE 1\nLINE 2 and then use that line:col to offset from the start of LINE 1\nLINE 2 (no leading newline) in the document.

I guess a possible fix (which also would save some RAM) would be to just work with byte offsets everywhere instead of line:col, but that would probably be hard. I guess you also have alternatives like this hack (which is mostly for illustration purposes):

diff --git a/src/parse/syntax.rs b/src/parse/syntax.rs
index 4d9e22fa6..d25d28058 100644
--- a/src/parse/syntax.rs
+++ b/src/parse/syntax.rs
@@ -4,6 +4,7 @@
 
 use std::{cell::Cell, env, fmt, hash::Hash, num::NonZeroU32};
 
+use line_numbers::LineNumber;
 use line_numbers::LinePositions;
 use line_numbers::SingleLineSpan;
 use typed_arena::Arena;
@@ -673,6 +674,7 @@ pub(crate) struct MatchedPos {
 fn split_atom_words(
     content: &str,
     pos: &[SingleLineSpan],
+    skipped_first: bool,
     opposite_content: &str,
     opposite_pos: &[SingleLineSpan],
     kind: AtomKind,
@@ -700,6 +702,15 @@ fn split_atom_words(
 
     let content_newlines = LinePositions::from(content);
     let opposite_content_newlines = LinePositions::from(opposite_content);
+    let root_pos = if skipped_first {
+        SingleLineSpan {
+            line: LineNumber::from(pos[0].line.0 - 1),
+            start_col: pos[0].start_col,
+            end_col: pos[0].end_col,
+        }
+    } else {
+        pos[0]
+    };
 
     let mut offset = 0;
     let mut opposite_offset = 0;
@@ -716,7 +727,7 @@ fn split_atom_words(
                         },
                         pos: content_newlines.from_region_relative_to(
                             // TODO: don't assume a single line atom.
-                            pos[0],
+                            root_pos,
                             offset,
                             offset + word.len(),
                         )[0],
@@ -728,7 +739,7 @@ fn split_atom_words(
                 // This word is present on both sides.
                 // TODO: don't assume this atom is on a single line.
                 let word_pos =
-                    content_newlines.from_region_relative_to(pos[0], offset, offset + word.len())
+                    content_newlines.from_region_relative_to(root_pos, offset, offset + word.len())
                         [0];
                 let opposite_word_pos = opposite_content_newlines.from_region_relative_to(
                     opposite_pos[0],
@@ -787,18 +798,22 @@ fn has_common_words(word_diffs: &Vec<myers_diff::DiffResult<&&str>>) -> bool {
 }
 
 /// Skip line spans at the beginning or end that have zero width.
-fn filter_empty_ends(line_spans: &[SingleLineSpan]) -> Vec<SingleLineSpan> {
+fn filter_empty_ends(line_spans: &[SingleLineSpan]) -> (Vec<SingleLineSpan>, bool) {
     let mut spans: Vec<SingleLineSpan> = vec![];
+    let mut skipped_first = false;
 
     for (i, span) in line_spans.iter().enumerate() {
         if (i == 0 || i == line_spans.len() - 1) && span.start_col == span.end_col {
+            if i == 0 {
+                skipped_first = true;
+            }
             continue;
         }
 
         spans.push(*span);
     }
 
-    spans
+    (spans, skipped_first)
 }
 
 impl MatchedPos {
@@ -812,7 +827,7 @@ impl MatchedPos {
         // or end. We still want empty positions in the middle of
         // multiline atoms, as a multiline string literal may include
         // empty lines.
-        let pos = filter_empty_ends(pos);
+        let (pos, skipped_first) = filter_empty_ends(pos);
 
         match ck {
             ReplacedComment(this, opposite) | ReplacedString(this, opposite) => {
@@ -839,7 +854,14 @@ impl MatchedPos {
                     AtomKind::Comment
                 };
 
-                split_atom_words(this_content, &pos, opposite_content, opposite_pos, kind)
+                split_atom_words(
+                    this_content,
+                    &pos,
+                    skipped_first,
+                    opposite_content,
+                    opposite_pos,
+                    kind,
+                )
             }
             Unchanged(opposite) => {
                 let opposite_pos = match opposite {

@sesse sesse changed the title Word differ returns out-of-bounds line numbers Word differ returns out-of-bounds column numbers Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant