Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Optimize HashMap: bucketed approach #4603

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

jfecher
Copy link
Contributor

@jfecher jfecher commented Mar 20, 2024

Description

Problem*

Resolves #4584

Summary*

HashMap on master is quite slow. Notably:

  • get/insert iterate over the entire HashMap since the map is flat and we can't use break in constrained code
  • since the map uses quadratic probing, the element is rehashed whenever a collision occurs.

I'm making this draft PR to test out a few alternatives in terms of proving time. So far I have two alternative implementations in mind:

  • A bucketed hash map (this PR). Using buckets for each element we now only need to iterate through one bucket on get/insert instead of the entire map. The downside to this is that now we can only have BUCKET_ITEMS maximum number of collisions before we must assert/error. There is also a memory tradeoff since the total amount of slots will be N * BUCKET_ITEMS. For reference, many probing hashmaps tend to have a capacity of roughly N * 2 - similar to Vectors.
  • Using unconstrained (PR here chore: Optimize HashMap: unconstrained approach #4605). Using unconstrained for retrieving an element we can use break which gives obvious speedups. The downside of this approach is more just the danger of using unconstrained, and that it requires the lib to be vigilant for under-constrained code. This is also a bit unspecific since "using constrained" is rather general. There will be large difference depending on the specific HashMap architecture chosen around the unconstrained functions. The current unconstrained code is more of a proof of concept as well. It isn't yet abstracted over every key, value type for example. It's accessor methods are also different in that they assert a value is present instead of returning an optional value. Another notable difference is that the pedersen hasher is used for the other hash map types, where the unconstrained approach does not hash the keys at all.

All approaches are works in progress so any suggestions on further optimizations are appreciated.

The different approaches used to be separate PRs, but they got too spammy. They're now just branches:

Tests

Simple Insert and Get

fn main(x: Field) {
    let mut map: HashMap<Field, Field, N, BuildHasherDefault<PedersenHasher>> = HashMap::default();
    map.insert(1, x);
    println(map.get(1));
}

Timings:

N = 100:
    master: 6.762s
    master (max iters = 1): 0.362s
    master (max iters = 4): 0.516s
    master (max iters = 8): 0.757s
    bucketed (bucket size = 2): 0.324s
    bucketed (bucket size = 4): 0.344s
    linear (max iters = 1): 0.364s
    linear (max iters = 4): 0.497s
    linear (max iters = 8): 0.710s
    unconstrained: 0.356s

N = 200:
    master: 35.304s
    master (max iters = 1): 0.431s
    master (max iters = 4): 0.849s
    master (max iters = 8): 1.462s
    bucketed (bucket size = 2): 0.355s
    bucketed (bucket size = 4): 0.350s
    linear (max iters = 1): 0.416s
    linear (max iters = 4): 0.753s
    linear (max iters = 8): 1.263s
    unconstrained: 0.386s

master & linear are both getting noticeably slower as the map gets larger which is odd & concerning.

75% full, 50% get

fn main(x: Field) {
    let mut map: HashMap<Field, u64, N, BuildHasherDefault<PedersenHasher>> = HashMap::default();

    // Insert into map until 75% full with pseudo-random keys
    // - Also perform a get on each even key
    for i in 0 .. N / 4 * 3 {
        // Add x to prevent anything from being optimized away before runtime
        let key = pedersen_hash([i as Field + x]);
        map.insert(key, i);

        if key as u64 % 2 == 0 {
            let value = map.get(key);
            // Ensure get isn't optimized out
            assert(value.is_some());
        }
    }
}

Timings (all use nargo compiled in debug, 1 warmup run, median of just 3 tries):

N = 100:
    master: OOM killed after 48.0s
    master (max iters = 1): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    master (max iters = 4): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    master (max iters = 8): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    bucketed (bucket size = 2): error: Assertion failed: 'Bucket in HashMap exceeded the maximum capacity of 2!'
    bucketed (bucket size = 4): 1.886s
    bucketed (bucket size = 5): 2.102s
    linear (max iters = 1): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    linear (max iters = 4): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    linear (max iters = 8): 42.3s
    unconstrained: fails after 2.87s with `path_check != 1`. More specifically:
        is_first_entry = false, insert_at_start = true, insert_at_end = false, insert_between_two_entries = true, found_collision = false'


N = 200:
    master: OOM killed after 30.6s
    master (max iters = 1): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    master (max iters = 4): OOM killed after 65.3s
    master (max iters = 8): OOM killed after 63.0s
    bucketed (bucket size = 2): error: Assertion failed: 'Bucket in HashMap exceeded the maximum capacity of 2!'
    bucketed (bucket size = 4): error: Assertion failed: 'Bucket in HashMap exceeded the maximum capacity of 4!'
    bucketed (bucket size = 5): 8.459s
    linear (max iters = 1): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    linear (max iters = 4): error: Assertion failed: 'HashMap::insert: too many collisions, out of space!'
    linear (max iters = 8): OOM killed after 53.03s
    unconstrained: fails after 11.615s with `path_check != 1`. More specifically:
        is_first_entry = false, insert_at_start = true, insert_at_end = false, insert_between_two_entries = true, found_collision = false'
  • There isn't much to say about master. It takes over half a minute before it is eventually OOM killed.
  • For the bucketed approach, as the map gets more full, the amount of buckets that fill up gets more full as well. With pseudo-random elements we get 5 collisions when a map with 200 buckets gets filled with 150 elements. Increasing the bucket size to 5 works but means we're now using 200 * 5 = 1000 total slots for elements.
  • For unconstrained, it looks like there is a logic error causing an assertion to fail currently. Even before the assertion triggers in the N = 200 case though, it is still slower than the bucketed approach with a bucket size of 5.

Documentation*

Check one:

  • No documentation needed.
  • Documentation included in this PR.
  • [Exceptional Case] Documentation to be submitted in a separate PR.

PR Checklist*

  • I have tested the changes locally.
  • I have formatted the changes with Prettier and/or cargo fmt on default settings.

@jfecher jfecher changed the title chore: Optimize HashMap chore: Optimize HashMap: bucketed approach Mar 20, 2024
@jfecher
Copy link
Contributor Author

jfecher commented Mar 21, 2024

I've been doing some testing on varying the bucket count with the actual size of the hashmap for the 75% filled test. I've found that increasing the bucket size actually tends to be more space effective than increasing the size of the overall hashmap.

For example, the 75% test passes with:

  • 5-len buckets, map of len 200 (1000 total slots)
  • 4-len buckets, map of len 400 (1600 total slots)
  • 2-len buckets, map of len 1000 (2000 total slots)

(tested in increments of 200 length). Each of these are minimal, so decreasing the bucket or map len (by the 200 increment at least) in any of these cases would cause the test to fail.

Also note that this is very dependent on how many hash collisions there are. This example uses the pedersen_hash function to generate a key from a loop index, then the inner hash map hashes this again via pedersen. As a rule, the fewer collisions there are (mod the hashmap length) the smaller the hashmap can be.

@jfecher
Copy link
Contributor Author

jfecher commented Mar 21, 2024

This unpredictability in what bucket size is needed adds to the general unpredictability of what size is needed for the overall hashmap. Increasing the bucket length ever more makes it more likely a hashmap of len N can actually store closer to N items, but increasing the bucket len also multiplies the total number of slots.

Other approaches like linear probing hashmaps store elements with conflicting hashes automatically in the next slot. This isn't practicle for this hashmap since we would never be able to break out of a loop when looping to a next slot. It could be worth to try this approach with a bounded maximum on this next slot loop though. Theoretically a linear probing hashmap with a bounded maximum pseudo-bucket size of say 5 would have much more sharing than a bucketed approach. This is because in the bucketed approach each bucket only stores elements of the same hash (mod len). The linear probe however, a random slice of say 5 in the hashmap may begin with 2 elements which had the same hash (mod len), followed by an empty slot, and followed by two more elements with different hashes. So a linear probing map with a maximum loop of say 10 elements could still be more space efficient than a bucketed hashmap with a bucket size of 5.

Note that for a linear probing map to be implemented in a flat fashion with a maximum loop bound like this, we also need to add this loop count of empty spaces to the end of the map to ensure we can still loop past the last slot that many times. This small constant factor shouldn't be an issue though.

@jfecher
Copy link
Contributor Author

jfecher commented Mar 21, 2024

So far the linear probing hashmap has been performing much worse than expected. Putting aside the size failures - I'm not sure why a linear probing map with 4 max iterations would be performing so much worse time-wise than a bucketed map with a bucket length of 4, and thus the same 4 iterations through each element of a bucket. When you then consider the linear probing map has an actual length and slot count of N instead of N * 4 for the bucketed map with 4-length buckets, this difference gets even weirder.

Copy link
Contributor

Changes to circuit sizes

Generated at commit: 6e58e74e49a6d22bd254370483dde0c6387c6f3e, compared to commit: ea3bb7f0673b9aa9b290ce263e6ce36475e0bc1a

🧾 Summary (10% most significant diffs)

Program ACIR opcodes (+/-) % Circuit size (+/-) %
hashmap -188,852 ✅ -86.13% -422,525 ✅ -85.92%

Full diff report 👇
Program ACIR opcodes (+/-) % Circuit size (+/-) %
hashmap 30,419 (-188,852) -86.13% 69,239 (-422,525) -85.92%

@jfecher
Copy link
Contributor Author

jfecher commented Mar 22, 2024

After speaking with @guipublic I've determined the somewhat odd performance patterns here to be the result of arrays being copied in Acir if they are used again afterward. All of these approaches write to the main array of the hashmap, but I think the bucketed approach may write to it less often since most writes will be to buckets until the bucket is written to the bucket array at the end.

I've made an issue for this here: #4621

I think adding a similar copy on write optimization that brillig has to these arrays will be useful. This is partly evidenced by the speedup these hashmaps see when main is changed to unconstrained. After doing this with no other changes, the 75% test for the bucketed map with bucket size 5 goes from 8.5s to 0.5s.

@jfecher
Copy link
Contributor Author

jfecher commented Mar 26, 2024

I've looked into this some more and found that every array_set is actually already mutated rather than copied. The existing last use analysis applies to 100% of array sets on master for example. Manually setting acir gen to always mutate also does not improve performance. I'm back to square one again on why master & linear are so much slower than the bucketed approach. I wonder if it could be related to #4629.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize HashMap
1 participant