How diffsitter works, How performant it is and What cases are exemplatory good and bad #149

matu3ba · 2021-07-13T08:53:20Z

Difftastic has a complete section dedicated what methods it uses, but has the performance not written in the repo but in the inspired projects:
https://github.com/Wilfred/difftastic#how-it-works
Could you elaborate what algorithm you use for the diff? A* and dijkstra for graph difference are the most prominent ones, where A* is much more performant on multiple changes (in a function) but has overall worse results.

Performance of examples (ideally with benchmarks), Known problems and Non-Goals would complete the first impression.

I am asking due for an evaluation of neovim what to use, but I am not affiliated or core member (only neovim user) and would like to see things moving forward.

afnanenayet · 2021-07-15T06:28:53Z

I can write up how it works. I started out with a very naive min-edit-distance algorithm, but I was thinking of switching to Myers since you're effectively comparing an in-order traversal of the graph and Myers is extremely efficient in terms of space (profiling showed that the current biggest bottleneck in diffsitter is malloc). Diffsitter itself has lots of logging timers and I've done some profiling with dtrace so I know where the current hotspots are.

I can also try operating directly on the graph differences and see which approach is the fastest, I've been meaning to improve the performance for a while now.

I'll add documentation with the known problems and non-goals soon (TM).

These are great suggestions, thanks :)

krishnakumarg1984 · 2022-05-26T12:21:40Z

What do the devs here think about considering various aspects mentioned here in a future release?

0xdevalias · 2024-01-30T07:45:05Z

@afnanenayet Curious what the state of this is in 2024?

Searching the repo, it looks like the Myers algorithm may have been implemented in September 2021, eg.:

perf: Use the Myer's diff algorithm #192

I was just trying to diff a large bundled/minified JS file (~8.4MB; ~249,128 lines long), and noticed that majority of the time was spent in the diff::compute_edit_script() step (737.685108096s, aka: ~12.29min); whereas parsing/processing the AST only took less than 2s each. I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet. You can see more detailed info/specifics in this comment:

Explore AST based diff tools 0xdevalias/chatgpt-source-watch#3 (comment)

Minimising the debug output to just the relevant timing events:

// Start
2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish

As a point of comparison, difftastic was seemingly able to diff this file in ~6.746sec total:

⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   2.63s user 0.46s system 47% cpu 6.494 total
subl  0.01s user 0.02s system 0% cpu 6.746 total

Though with a lot of this printed among the diff output:

_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)

Edit: It seems when DFT_BYTE_LIMIT is exceeded difftastic falls back to a text diff, so that's not really a fair time comparison:

https://github.com/Wilfred/difftastic/blob/master/CHANGELOG.md#020-released-20th-february-2022
- difftastic will now use a text diff for large files that are too big to parse in a reasonable amount of time. This threshold is configurable with --byte-limit and DFT_BYTE_LIMIT.

I tried overriding that in my .gitconfig:

# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
  cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"

And then running it again, but then I just got a different set of errors:

 ⇒ time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   12.42s user 1.10s system 79% cpu 17.043 total
subl  0.01s user 0.02s system 0% cpu 17.248 total

_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)

afnanenayet · 2024-01-30T13:23:18Z

Could you send me the file you were trying to diff? Would love to do some profiling and see if there's anything I can do to make diffsitter faster. Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.

I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly

afnanenayet · 2024-01-30T13:24:53Z

I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet.

Also very intrigued by this. What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.

Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!

0xdevalias · 2024-01-30T21:21:00Z

Could you send me the file you were trying to diff?

@afnanenayet Was trying to diff these 2 (basically to see the changes between the 2 builds/commits):

Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.

@afnanenayet For my specific use case / interest here, timing out would be suboptimal, as ideally I would like it to run even if it takes a longer time to do it (assuming the output ends up being good). Though a timeout/error for others in more normal use cases might make sense.

I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly

@afnanenayet Oo, interesting! Is that for speed reasons, or better outcomes, or?

I also think it might have silently crashed somewhere along the way

What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.

@afnanenayet See below for more details, but it was basically a gut feel based on not seeing the diff output (I might have closed the file it was piping to by accident though), and not seeing the 'end hunk' line in the debug logs.

You can see the full output/etc in this comment, look for this part:

We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)
diffsitter debug output from large example
We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.

Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.
Explore AST based diff tools 0xdevalias/chatgpt-source-watch#3 (comment)

Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!

@afnanenayet Thanks for making an awesome tool! It's exciting to see such quick/positive responses; I had somewhat assumed that it might have been abandonware.

afnanenayet · 2024-02-02T03:50:27Z

Nope! Not abandonware, I'm just busy and have been working on these things really slowly

0xdevalias · 2024-02-02T09:53:09Z

Haha, totally understand that feel.

afnanenayet added documentation Improvements or additions to documentation question Further information is requested and removed question Further information is requested labels Jul 21, 2021

0xdevalias mentioned this issue Jan 30, 2024

Explore AST based diff tools 0xdevalias/chatgpt-source-watch#3

Open

0xdevalias mentioned this issue Jan 30, 2024

Support ignoring differences that only consist of variable/function name changes (eg. within minified JavaScript) Wilfred/difftastic#631

Open

0xdevalias mentioned this issue Jan 31, 2024

Support ignoring differences that only consist of variable/function name changes (eg. within minified JavaScript) #819

Open

0xdevalias mentioned this issue Apr 24, 2024

Simple diff analysis based on strings and identifiers 0xdevalias/chatgpt-source-watch#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How diffsitter works, How performant it is and What cases are exemplatory good and bad #149

How diffsitter works, How performant it is and What cases are exemplatory good and bad #149

matu3ba commented Jul 13, 2021 •

edited

afnanenayet commented Jul 15, 2021 •

edited

krishnakumarg1984 commented May 26, 2022 •

edited

0xdevalias commented Jan 30, 2024 •

edited

afnanenayet commented Jan 30, 2024

afnanenayet commented Jan 30, 2024

0xdevalias commented Jan 30, 2024 •

edited

afnanenayet commented Feb 2, 2024

0xdevalias commented Feb 2, 2024

How diffsitter works, How performant it is and What cases are exemplatory good and bad #149

How diffsitter works, How performant it is and What cases are exemplatory good and bad #149

Comments

matu3ba commented Jul 13, 2021 • edited

afnanenayet commented Jul 15, 2021 • edited

krishnakumarg1984 commented May 26, 2022 • edited

0xdevalias commented Jan 30, 2024 • edited

afnanenayet commented Jan 30, 2024

afnanenayet commented Jan 30, 2024

0xdevalias commented Jan 30, 2024 • edited

afnanenayet commented Feb 2, 2024

0xdevalias commented Feb 2, 2024

matu3ba commented Jul 13, 2021 •

edited

afnanenayet commented Jul 15, 2021 •

edited

krishnakumarg1984 commented May 26, 2022 •

edited

0xdevalias commented Jan 30, 2024 •

edited

0xdevalias commented Jan 30, 2024 •

edited