Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How diffsitter works, How performant it is and What cases are exemplatory good and bad #149

Open
matu3ba opened this issue Jul 13, 2021 · 8 comments
Labels
documentation Improvements or additions to documentation

Comments

@matu3ba
Copy link

matu3ba commented Jul 13, 2021

Difftastic has a complete section dedicated what methods it uses, but has the performance not written in the repo but in the inspired projects:
https://github.com/Wilfred/difftastic#how-it-works
Could you elaborate what algorithm you use for the diff? A* and dijkstra for graph difference are the most prominent ones, where A* is much more performant on multiple changes (in a function) but has overall worse results.

Performance of examples (ideally with benchmarks), Known problems and Non-Goals would complete the first impression.

I am asking due for an evaluation of neovim what to use, but I am not affiliated or core member (only neovim user) and would like to see things moving forward.

@afnanenayet
Copy link
Owner

afnanenayet commented Jul 15, 2021

I can write up how it works. I started out with a very naive min-edit-distance algorithm, but I was thinking of switching to Myers since you're effectively comparing an in-order traversal of the graph and Myers is extremely efficient in terms of space (profiling showed that the current biggest bottleneck in diffsitter is malloc). Diffsitter itself has lots of logging timers and I've done some profiling with dtrace so I know where the current hotspots are.

I can also try operating directly on the graph differences and see which approach is the fastest, I've been meaning to improve the performance for a while now.

I'll add documentation with the known problems and non-goals soon (TM).

These are great suggestions, thanks :)

@afnanenayet afnanenayet added documentation Improvements or additions to documentation question Further information is requested and removed question Further information is requested labels Jul 21, 2021
@krishnakumarg1984
Copy link

krishnakumarg1984 commented May 26, 2022

What do the devs here think about considering various aspects mentioned here in a future release?

@0xdevalias
Copy link
Contributor

0xdevalias commented Jan 30, 2024

@afnanenayet Curious what the state of this is in 2024?

Searching the repo, it looks like the Myers algorithm may have been implemented in September 2021, eg.:

I was just trying to diff a large bundled/minified JS file (~8.4MB; ~249,128 lines long), and noticed that majority of the time was spent in the diff::compute_edit_script() step (737.685108096s, aka: ~12.29min); whereas parsing/processing the AST only took less than 2s each. I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet. You can see more detailed info/specifics in this comment:

Minimising the debug output to just the relevant timing events:

// Start
2024-01-30T07:07:58.282Z INFO  libdiffsitter::parse > Deduced language "typescript" from extension "js" provided from user mappings
// ..snip..
2024-01-30T07:07:59.791Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.50364822s
// ..snip..
2024-01-30T07:08:01.160Z INFO  TimerFinished        > parse::parse_file(), Elapsed=1.364231801s
2024-01-30T07:08:01.985Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=825.767557ms
2024-01-30T07:08:02.972Z INFO  TimerFinished        > ast::process(), Elapsed=1.812168007s
2024-01-30T07:08:03.814Z INFO  TimerFinished        > ast::from_ts_tree(), Elapsed=841.781925ms
2024-01-30T07:08:04.787Z INFO  TimerFinished        > ast::process(), Elapsed=1.815188386s
2024-01-30T07:20:22.478Z INFO  TimerFinished        > diff::compute_edit_script(), Elapsed=737.685108096s
// ..snip..
2024-01-30T07:20:22.822Z DEBUG libdiffsitter::render::unified > Printing line 96301
// Finish

As a point of comparison, difftastic was seemingly able to diff this file in ~6.746sec total:

time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   2.63s user 0.46s system 47% cpu 6.494 total
subl  0.01s user 0.02s system 0% cpu 6.746 total

Though with a lot of this printed among the diff output:

_app.js --- 1/674 --- Text (8.39 MiB exceeded DFT_BYTE_LIMIT)

Edit: It seems when DFT_BYTE_LIMIT is exceeded difftastic falls back to a text diff, so that's not really a fair time comparison:

I tried overriding that in my .gitconfig:

# https://github.com/Wilfred/difftastic
[difftool "difftastic"]
  cmd = difft --byte-limit 20971520 "$LOCAL" "$REMOTE"

And then running it again, but then I just got a different set of errors:

time git difftool --tool difftastic HEAD~1 HEAD -- unpacked/_next/static/chunks/pages/_app.js | subl
git difftool --tool difftastic HEAD~1 HEAD --   12.42s user 1.10s system 79% cpu 17.043 total
subl  0.01s user 0.02s system 0% cpu 17.248 total
_app.js --- 1/674 --- Text (2 JavaScript parse errors, exceeded DFT_PARSE_ERROR_LIMIT)

@afnanenayet
Copy link
Owner

Could you send me the file you were trying to diff? Would love to do some profiling and see if there's anything I can do to make diffsitter faster. Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.

I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly

@afnanenayet
Copy link
Owner

I also think it might have silently crashed somewhere along the way, but I haven't had a chance to try it again to confirm that yet.

Also very intrigued by this. What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.

Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!

@0xdevalias
Copy link
Contributor

0xdevalias commented Jan 30, 2024

Could you send me the file you were trying to diff?

@afnanenayet Was trying to diff these 2 (basically to see the changes between the 2 builds/commits):


Alternatively we could have some timeout for computing the edit script and fallback to a simpler diff algo or error out.

@afnanenayet For my specific use case / interest here, timing out would be suboptimal, as ideally I would like it to run even if it takes a longer time to do it (assuming the output ends up being good). Though a timeout/error for others in more normal use cases might make sense.


I'm actually working on ditching the myers diff algo in favor of computing a diff on the tree directly

@afnanenayet Oo, interesting! Is that for speed reasons, or better outcomes, or?


I also think it might have silently crashed somewhere along the way

What did you see that led you to suspect this? If you get a chance to investigate feel free to throw me whatever you got.

@afnanenayet See below for more details, but it was basically a gut feel based on not seeing the diff output (I might have closed the file it was piping to by accident though), and not seeing the 'end hunk' line in the debug logs.

You can see the full output/etc in this comment, look for this part:

  • We can then see where the real performance hit is when running this against a MUCH larger file/diff (the file being diffed is ~8.4MB; ~249,128 lines long)
    diffsitter debug output from large example

  • We can see that ~12.29 minutes was spent in diff::compute_edit_script(), and then it looks like the script might have silently crashed, as I didn't see the diff output, nor the final DEBUG libdiffsitter::render::unified > End hunk line that I would have expected to see.

    Note: I was piping this command to subl (Sublime Text), despite what the code block above was edited to look like; I haven't tried running this again yet printing directly to STDOUT/a file. I also haven't tried running this without debug mode to see if it somehow doesn't crash that way.

  • Explore AST based diff tools 0xdevalias/chatgpt-source-watch#3 (comment)


Thanks for all the work you've put in to raising issues and helping with debugging, it's much appreciated!

@afnanenayet Thanks for making an awesome tool! It's exciting to see such quick/positive responses; I had somewhat assumed that it might have been abandonware.

@afnanenayet
Copy link
Owner

Nope! Not abandonware, I'm just busy and have been working on these things really slowly

@0xdevalias
Copy link
Contributor

Haha, totally understand that feel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants