Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some optimizations (3) #401

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

QuarticCat
Copy link
Contributor

@QuarticCat QuarticCat commented Oct 7, 2022

A continuation of #393 and #395.

Use the same benchmark and baseline as in #395.

Speed: 383%

Memory: 31%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     292.2 ms ±   4.4 ms    [User: 277.2 ms, System: 14.6 ms]
  Range (min … max):   288.9 ms … 301.9 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            282.10 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,228      page-faults:u             #    4.353 K/sec                  
       961,960,716      cycles:u                  #    3.410 GHz                    
        14,537,438      stalled-cycles-frontend:u #    1.51% frontend cycles idle   
       294,187,883      stalled-cycles-backend:u  #   30.58% backend cycles idle    
     1,441,578,806      instructions:u            #    1.50  insn per cycle         
                                                  #    0.20  stalled cycles per insn
       245,978,361      branches:u                #  871.963 M/sec                  
         8,989,475      branch-misses:u           #    3.65% of all branches        

       0.282490414 seconds time elapsed

       0.272161000 seconds user
       0.010072000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.24s  user 0.02s system 99% cpu 0.261 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                169 MB
page faults from disk:     0
other page faults:         1295

Two major optimizations:

  • (5e5eef2, 7e4e747) Use bitmap to represent the stack, with a constraint that stack size is no more than 63 (which represents 63 layers of delimiters, which should be rare), but can be easily extended.

  • (10ce859 to edc5516) Prune all ExitDelimiter* nodes and edges, which account for around 15%~25% of all nodes, depending on the input. This also fixes the following diff by the way:

Before:

image

After:

image

I think my work deserves a version bump and maybe a reddit announcement XD.

@QuarticCat
Copy link
Contributor Author

Hi, I recently wrote a new crate that helps define tagged pointers. I haven't published it yet. Would you be interested in reviewing and using it?

https://github.com/QuarticCat/enum-ptr

@QuarticCat
Copy link
Contributor Author

QuarticCat commented Nov 13, 2022

Speed: 400%
Memory: 29%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     279.8 ms ±   3.1 ms    [User: 262.8 ms, System: 16.6 ms]
  Range (min … max):   276.1 ms … 286.0 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            279.68 msec task-clock:u                     #    0.999 CPUs utilized          
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
             1,224      page-faults:u                    #    4.376 K/sec                  
       948,057,136      cycles:u                         #    3.390 GHz                    
        17,016,497      stalled-cycles-frontend:u        #    1.79% frontend cycles idle   
       259,400,676      stalled-cycles-backend:u         #   27.36% backend cycles idle    
     1,304,835,983      instructions:u                   #    1.38  insn per cycle         
                                                  #    0.20  stalled cycles per insn
       240,075,671      branches:u                       #  858.387 M/sec                  
         9,024,946      branch-misses:u                  #    3.76% of all branches        

       0.280024823 seconds time elapsed

       0.259759000 seconds user
       0.019976000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.24s  user 0.01s system 99% cpu 0.247 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                159 MB
page faults from disk:     0
other page faults:         1286

I discovered another representation of parent stacks: records the parent vertex of the last PopBoth vertex, and the number of PopEither in two stacks. We can prove that if two vertices are equal and have the same parent stack, then their pop_both_ancestor must be equal. Thus with this new representation, we can also achieve O(1) whole stack comparison. This refactorization also relaxes the parent number limitation from 63 in total to u16::MAX consecutive PopEither. Also, it removes 8 bytes from Vertex since the two new u16 can be placed in the padding of Cell<boll>.

@QuarticCat
Copy link
Contributor Author

@Wilfred Hi, I wrote a summary that you might be interested in.

https://blog.quarticcat.com/posts/optimize-difftastic/

Wilfred added a commit that referenced this pull request Dec 28, 2022
@QuarticCat observed that popping delimiters is unnecessary, and saw a
speedup in PR #401. This reduces the number of nodes in typical graphs
by ~20%, reducing runtime and memory usage.

This works because there is only one thing we can do at the end of a
list: pop the delimiter. The syntax node on the other side does not
give us more options, we have at most one. Popping all the delimiters
as soon as possible is equivalent, and produces the same graph route.

This change has also slightly changed the output of
samples_files/slow_after.rs, producing a better (more minimal)
diff. This is probably luck, due to the path-dependent nature of the
route solving logic, but it's a positive sign.

A huge thanks to @QuarticCat for their contributions, this is a huge
speedup.

Co-authored-by: QuarticCat <QuarticCat@pm.me>
@QuarticCat
Copy link
Contributor Author

I found that I forgot to update the latest progress (afe07ea - 80eae29). They virtually have no effect on the speed, but the memory usage dropped down to 125MB, which is ~23% of the original.

To further optimize memory usage, you can:

  • Remove unused fields from SyntaxInfo, including previous_sibling, prev, num_after, and unique_id (replaced by pointers as they are pinned). You might want to keep them for experimenting A* heuristics though. I haven't found a way that can easily remove & add these fields.
  • Simplify Syntax, including
    • Change Vec<T> to Box<[T]>
    • Maybe concat {open,close}_position and {open,close}_content
    • Change String to Box<str> or &str
    • Remove children, as head_child + next_sibling is enough

These might give you another ~5MB reduction and that will probably result in a slightly better performance gained from better locality and better memory access pattern (syntax.head_child vs syntax.children.get(0)).

@QuarticCat
Copy link
Contributor Author

QuarticCat commented Feb 8, 2023

I guess you are not willing to adopt some of my optimizations, which is fine. But there are still some harmless yet beneficial optimizations:

  • Skip visited vertices and remove the neighbor field from Vertex
  • Refactor parent stack (pop_both_ancestor + pop_lhs_cnt + pop_rhs_cnt)

BTW, I find a TODO in your current code:

TODO: experiment with storing SyntaxId only, and have a HashMap from SyntaxId to &Syntax.

I don't think this is a good idea, even a Vec is better than HashMap here. Yet you still have a big problem: how to implement Hash for Vertex? Unless this Vec / Hashmap is a global / thread-local variable, you cannot access it inside hash impl. Storing references to it in vertices is even worse because you lose the memory advantage.

According to my profiling results, most of the time was spent on hash map operations. We know that hash function randomly distributes keys into different slots. In a large hashmap (larger than your cache), that means perfect cache misses. I have a benchmark to verify this idea and I hope it can bring you some inspiration.

@Wilfred
Copy link
Owner

Wilfred commented Feb 9, 2023

Hey @QuarticCat!

Yep, I won't accept all these commits in their current form, but there's so many good ideas here that I still want to look at and apply. For example, in #395 you make the really clever observation that a Vertex only ever has a lhs_syntax or a lhs_parent, but not both, so there's major potential for memory savings there.

I'm not sure about the exact wording of the APIs, and the SyntaxRefOrId made me think that I'd be better off treating lhs_syntax and lhs_parent consistently if at all possible.

I've also been thinking about the unsafe usage and decided that I'd rather take the performance loss rather than reason about it. I see you've run Miri on your standalone pointer tagging library, which gives me a bunch more confidence in that implementation :)

I don't think this is a good idea, even a Vec is better than HashMap here. Yet you still have a big problem: how to implement Hash for Vertex?

This was inspired by on your work that made vertices smaller :). I was thinking that the HashMap would not be stored in Vertex, instead Vertex would only contain SyntaxId and I can have a separate global HashMap<SyntaxId, &Syntax> to get the actual syntax nodes. Hash for Vertex would still be fine: I can totally Hash a SyntaxId.

It's possible that the additional hashmap lookup would hurt performance, but overall (again based on your observations) it looks like performance is very sensitive to the size of Vertex. It depends on what the benchmarks show, of course.

@Wilfred
Copy link
Owner

Wilfred commented Feb 9, 2023

(I'm also still hopeful that I'll eventually get an A* implementation that's a performance win too. So many interesting avenues to explore!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants