Some optimizations (3) #401

QuarticCat · 2022-10-07T06:45:59Z

A continuation of #393 and #395.

Use the same benchmark and baseline as in #395.

Speed: 383%

Memory: 31%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     292.2 ms ±   4.4 ms    [User: 277.2 ms, System: 14.6 ms]
  Range (min … max):   288.9 ms … 301.9 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            282.10 msec task-clock:u              #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
             1,228      page-faults:u             #    4.353 K/sec                  
       961,960,716      cycles:u                  #    3.410 GHz                    
        14,537,438      stalled-cycles-frontend:u #    1.51% frontend cycles idle   
       294,187,883      stalled-cycles-backend:u  #   30.58% backend cycles idle    
     1,441,578,806      instructions:u            #    1.50  insn per cycle         
                                                  #    0.20  stalled cycles per insn
       245,978,361      branches:u                #  871.963 M/sec                  
         8,989,475      branch-misses:u           #    3.65% of all branches        

       0.282490414 seconds time elapsed

       0.272161000 seconds user
       0.010072000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.24s  user 0.02s system 99% cpu 0.261 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                169 MB
page faults from disk:     0
other page faults:         1295

Two major optimizations:

(5e5eef2, 7e4e747) Use bitmap to represent the stack, with a constraint that stack size is no more than 63 (which represents 63 layers of delimiters, which should be rare), but can be easily extended.
(10ce859 to edc5516) Prune all ExitDelimiter* nodes and edges, which account for around 15%~25% of all nodes, depending on the input. This also fixes the following diff by the way:

Before:

After:

I think my work deserves a version bump and maybe a reddit announcement XD.

QuarticCat · 2022-10-30T16:20:28Z

Hi, I recently wrote a new crate that helps define tagged pointers. I haven't published it yet. Would you be interested in reviewing and using it?

https://github.com/QuarticCat/enum-ptr

QuarticCat · 2022-11-13T20:02:18Z

Speed: 400%
Memory: 29%

>>>>> BENCH START
Benchmark 1: /home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs
  Time (mean ± σ):     279.8 ms ±   3.1 ms    [User: 262.8 ms, System: 16.6 ms]
  Range (min … max):   276.1 ms … 286.0 ms    10 runs
 
>>>>> BENCH END
>>>>> BENCH START

 Performance counter stats for '/home/qc/.cargo/target/release/difft sample_files/slow_before.rs sample_files/slow_after.rs':

            279.68 msec task-clock:u                     #    0.999 CPUs utilized          
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
             1,224      page-faults:u                    #    4.376 K/sec                  
       948,057,136      cycles:u                         #    3.390 GHz                    
        17,016,497      stalled-cycles-frontend:u        #    1.79% frontend cycles idle   
       259,400,676      stalled-cycles-backend:u         #   27.36% backend cycles idle    
     1,304,835,983      instructions:u                   #    1.38  insn per cycle         
                                                  #    0.20  stalled cycles per insn
       240,075,671      branches:u                       #  858.387 M/sec                  
         9,024,946      branch-misses:u                  #    3.76% of all branches        

       0.280024823 seconds time elapsed

       0.259759000 seconds user
       0.019976000 seconds sys


>>>>> BENCH END
~/.cargo/target/release/difft sample_files/slow_{before,after}.rs > /dev/null   0.24s  user 0.01s system 99% cpu 0.247 total
avg shared (code):         0 KB
avg unshared (data/stack): 0 KB
total (sum):               0 KB
max memory:                159 MB
page faults from disk:     0
other page faults:         1286

I discovered another representation of parent stacks: records the parent vertex of the last PopBoth vertex, and the number of PopEither in two stacks. We can prove that if two vertices are equal and have the same parent stack, then their pop_both_ancestor must be equal. Thus with this new representation, we can also achieve O(1) whole stack comparison. This refactorization also relaxes the parent number limitation from 63 in total to u16::MAX consecutive PopEither. Also, it removes 8 bytes from Vertex since the two new u16 can be placed in the padding of Cell<boll>.

QuarticCat · 2022-11-14T13:30:09Z

@Wilfred Hi, I wrote a summary that you might be interested in.

https://blog.quarticcat.com/posts/optimize-difftastic/

@QuarticCat

@QuarticCat observed that popping delimiters is unnecessary, and saw a speedup in PR #401. This reduces the number of nodes in typical graphs by ~20%, reducing runtime and memory usage. This works because there is only one thing we can do at the end of a list: pop the delimiter. The syntax node on the other side does not give us more options, we have at most one. Popping all the delimiters as soon as possible is equivalent, and produces the same graph route. This change has also slightly changed the output of samples_files/slow_after.rs, producing a better (more minimal) diff. This is probably luck, due to the path-dependent nature of the route solving logic, but it's a positive sign. A huge thanks to @QuarticCat for their contributions, this is a huge speedup. Co-authored-by: QuarticCat <QuarticCat@pm.me>

QuarticCat · 2023-02-08T18:13:32Z

I found that I forgot to update the latest progress (afe07ea - 80eae29). They virtually have no effect on the speed, but the memory usage dropped down to 125MB, which is ~23% of the original.

To further optimize memory usage, you can:

Remove unused fields from SyntaxInfo, including previous_sibling, prev, num_after, and unique_id (replaced by pointers as they are pinned). You might want to keep them for experimenting A* heuristics though. I haven't found a way that can easily remove & add these fields.
Simplify Syntax, including
- Change Vec<T> to Box<[T]>
- Maybe concat {open,close}_position and {open,close}_content
- Change String to Box<str> or &str
- Remove children, as head_child + next_sibling is enough

These might give you another ~5MB reduction and that will probably result in a slightly better performance gained from better locality and better memory access pattern (syntax.head_child vs syntax.children.get(0)).

QuarticCat · 2023-02-08T19:06:35Z

I guess you are not willing to adopt some of my optimizations, which is fine. But there are still some harmless yet beneficial optimizations:

Skip visited vertices and remove the neighbor field from Vertex
Refactor parent stack (pop_both_ancestor + pop_lhs_cnt + pop_rhs_cnt)

BTW, I find a TODO in your current code:

TODO: experiment with storing SyntaxId only, and have a HashMap from SyntaxId to &Syntax.

I don't think this is a good idea, even a Vec is better than HashMap here. Yet you still have a big problem: how to implement Hash for Vertex? Unless this Vec / Hashmap is a global / thread-local variable, you cannot access it inside hash impl. Storing references to it in vertices is even worse because you lose the memory advantage.

According to my profiling results, most of the time was spent on hash map operations. We know that hash function randomly distributes keys into different slots. In a large hashmap (larger than your cache), that means perfect cache misses. I have a benchmark to verify this idea and I hope it can bring you some inspiration.

Wilfred · 2023-02-09T02:28:39Z

Hey @QuarticCat!

Yep, I won't accept all these commits in their current form, but there's so many good ideas here that I still want to look at and apply. For example, in #395 you make the really clever observation that a Vertex only ever has a lhs_syntax or a lhs_parent, but not both, so there's major potential for memory savings there.

I'm not sure about the exact wording of the APIs, and the SyntaxRefOrId made me think that I'd be better off treating lhs_syntax and lhs_parent consistently if at all possible.

I've also been thinking about the unsafe usage and decided that I'd rather take the performance loss rather than reason about it. I see you've run Miri on your standalone pointer tagging library, which gives me a bunch more confidence in that implementation :)

I don't think this is a good idea, even a Vec is better than HashMap here. Yet you still have a big problem: how to implement Hash for Vertex?

This was inspired by on your work that made vertices smaller :). I was thinking that the HashMap would not be stored in Vertex, instead Vertex would only contain SyntaxId and I can have a separate global HashMap<SyntaxId, &Syntax> to get the actual syntax nodes. Hash for Vertex would still be fine: I can totally Hash a SyntaxId.

It's possible that the additional hashmap lookup would hurt performance, but overall (again based on your observations) it looks like performance is very sensitive to the size of Vertex. It depends on what the benchmarks show, of course.

Wilfred · 2023-02-09T02:29:41Z

(I'm also still hopeful that I'll eventually get an A* implementation that's a performance win too. So many interesting avenues to explore!)

QuarticCat added 30 commits September 30, 2022 02:38

Simplify push_{lhs,rhs}_delimiter

7cd3b10

Remove field can_pop_either from Vertex

97f9a52

Reduce number of branches of Vertex::eq

9f1a0ab

Compress &Syntax and SyntaxId into a usize

d2f5e99

Simplify get_set_neighbours using Default::default

b635df4

Simplify get_set_neighbours by extracting functions

69b9214

Eliminate some vec clones

827e26d

Split predecessor to two parts

1631eb2

Change a RefCell in Vertex to UnsafeCell

9c551ab

Refactor seen map

7d36c5f

Minor fixes

2293886

Refactor parents representation

2c6b706

Compress EnteredDelimiter

cb1c3e0

Simplify set_neighbours by extracting a lambda

4e450cb

Reserve vec capacity

b0ab6c8

Remove transmute_copy

5043858

Compress seen map

8327c99

Skip visited vertices

3612d08

Refactor shortest path algorithm

9e11a22

Skip visited vertices when searching neighbors

a47000f

Simplify code

88ba693

Compress stack to bitmap

5e5eef2

Add comments

ec52830

Add asserts

ee725a1

Reuse neighbors vector

8a0c82a

Add root to SyntaxInfo to optimize parent comparison

0e9a4c2

Optimize can_pop_either

63eadc1

Update assert text

82ce472

Reserve change_map space

8625f62

Combine insert_deep_unchanged calls

d921c75

Remove root field from SyntaxInfo

1ede273

QuarticCat force-pushed the combine-vertices branch from 3b149ba to a32e718 Compare November 12, 2022 13:34

Fix typo

bc09acc

QuarticCat force-pushed the combine-vertices branch from a32e718 to bc09acc Compare November 12, 2022 14:00

QuarticCat added 3 commits November 13, 2022 05:58

Refactor parent stack

b95c3a6

Optimize from_parent

153502e

Use insert_unique_unchecked

cea109e

Fix arena reservation

0eb70c8

QuarticCat force-pushed the combine-vertices branch from f746078 to 813a16c Compare November 21, 2022 09:47

Reduce branches of get_neighbours

a0c3293

QuarticCat force-pushed the combine-vertices branch from 813a16c to a0c3293 Compare November 21, 2022 09:51

QuarticCat added 10 commits November 22, 2022 02:59

Pre-compute edge cost to reduce the size of Vertex

afe07ea

Merge similar code logic

d7d1653

Simplify code

96a779c

Optimize parent_stack_eq

864fecf

Optimize next_vertex

8845aab

Make SideSyntax conform to strict provenance

63c2830

Further squeeze Vertex

eb16d57

Fine tune next_sibling & next_child

88d5cee

Simplify next_vertex

5167d30

Fix comment

80eae29

QuarticCat mentioned this pull request Dec 1, 2022

Show function context for diff hunks. #402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimizations (3) #401

Some optimizations (3) #401

QuarticCat commented Oct 7, 2022 •

edited

QuarticCat commented Oct 30, 2022

QuarticCat commented Nov 13, 2022 •

edited

QuarticCat commented Nov 14, 2022

QuarticCat commented Feb 8, 2023

QuarticCat commented Feb 8, 2023 •

edited

Wilfred commented Feb 9, 2023

Wilfred commented Feb 9, 2023

Some optimizations (3) #401

Are you sure you want to change the base?

Some optimizations (3) #401

Conversation

QuarticCat commented Oct 7, 2022 • edited

QuarticCat commented Oct 30, 2022

QuarticCat commented Nov 13, 2022 • edited

QuarticCat commented Nov 14, 2022

QuarticCat commented Feb 8, 2023

QuarticCat commented Feb 8, 2023 • edited

Wilfred commented Feb 9, 2023

Wilfred commented Feb 9, 2023

QuarticCat commented Oct 7, 2022 •

edited

QuarticCat commented Nov 13, 2022 •

edited

QuarticCat commented Feb 8, 2023 •

edited