Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ahash for our own HashMap and HashSet objects #10502

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mqudsi
Copy link
Contributor

@mqudsi mqudsi commented May 18, 2024

We already transitively take a dependency on the ahash crate, so might as well use it. While not all of our HashMap/HashSet objects are performance-sensitive, some of them are and there's no cost to migrating them all over.

By default, rust uses a DoS-resistant, cryptographic hash (SipHash) to hash the keys for HashMap/HashSet entries into buckets, which is slower than what we used to use in the C++ world. We don't use any of it in a security-sensitive context or based off of unvalidated/untrusted/remote input, so there are no security reasons to continue to use SipHash.

Some of fish's HashMap/HashSet instances could benefit from being faster, e.g. the functions cache, various completion caches, environment variable tables, and history lookups, so this isn't just a "use it because it's there" kind of thing.

We already transitively take a dependency on the ahash crate, so might as well
use it. While not all of our HashMap/HashSet objects are performance-sensitive,
some of them are and there's no cost to migrating them all over.

By default, rust uses a DoS-resistant, cryptographic hash (SipHash) to hash the
keys for HashMap/HashSet entries into buckets, which is slower than what we used
to use in the C++ world. We don't use any of it in a security-sensitive context
or based off of unvalidated/untrusted/remote input, so there are no security
reasons to continue to use SipHash.

Some of fish's HashMap/HashSet instances could benefit from being faster, e.g.
the functions cache, various completion caches, environment variable tables, and
history lookups, so this isn't just a "use it because it's there" kind of thing.
@mqudsi mqudsi added performance Purely performance-related enhancement without any changes in black box output rust labels May 18, 2024
@ridiculousfish
Copy link
Member

I don't mind the implications of the change, but it is noisy - it touches every HashMap and HashSet. If this could be set globally I'd be in favor, but I don't like cluttering up the code without a clear, measurable benefit.

We should just use the default hash algorithm generally, and switch to a faster algorithm in any cases where profiling shows hashing to be a bottleneck.

@Thomas-airmatrix
Copy link

Just some thoughts by a passer-by.

I have found either always using std or always using ahash is nice because you don't have spend effort on making a choice each time. Although, a potential issue of choosing a non-std map globally is libraries that non-generically consume/produce std ones however I doubt that will be an issue for fish.

Would be cool to see some benchmarks for the global change though.

A potentially less noisy approach might be to use the provided aliases in the ahash crate.
https://docs.rs/ahash/latest/ahash/type.HashMap.html

The std hashmap already has to be imported so the diff is mainly the import path. (Will still need the new() to default() as new() hardcodes the hasher).

- use std::collections::{HashMap, HashSet};
+ use ahash::{HashMap, HashSet};

Although, you then have to remember to auto import the right HashMap.

Note: Using use ahash::{AHashMap as HashMap} would allow using the new() method but that would cause 3 levels of newtype wrapping of generic structs:
AHashMap -> std::_::HashMap -> hashbrown::HashMap

With the current structure of the code base you could also define an alias somewhere like common and then import from there. That would allow a global change in the future. Also since rust-analyzer's auto-import sorting prefers crate level imports by default it'll import the correct one with an alias defined in common. However, if y'all ever split fish up into smaller crates in a workspace that might be less helpful.

@mqudsi
Copy link
Contributor Author

mqudsi commented May 23, 2024

I mainly feel the same way about this as @ridiculousfish does (despite being the one that opened the issue). Since the runtime/build cost of the change is zero, I initially went for a blanket find-and-replace approach (not literally, since that actually doesn't compile) but need to be convinced by benchmarks that the churn is worth it because of the degraded contribution experience. It's "just" HashMap and HashSet right now, but we can easily end up in a case like the poor folk at Google or Meta (née Facebook) that basically can't compile a single line of code without using custom types for everything. Given that the most "one place to change things" solution in lieu of a genuine crate-level option is to pub(crate) use ahash::HashMap as HashMap in src/lib.rs then using use crate::HashMap instead of use std::collections::HashMap everywhere, it might indeed make more sense to see which maps/sets truly benefit from the change (though this is the extra work on behalf of the maintainers that I was hoping to avoid).

@Thomas-airmatrix:

I don't like the ahash::HashMap because it's not a type alias but a type wrapper (as you end up mentioning), and feel like for a type that offers an almost identical API with the standard library it's not a great option when, thanks to rust type inference, the solution was just to add a second/third generic type parameter to each HashMap/HashSet decaration and then the rest of the uses sort of just work with type inference.

I think the crate-level use-indirection I mentioned above would be a more elegant solution than use ahash::HashMap as HashMap in file, but it's certainly still less than ideal.

One thing about this PR that I did not enjoy: while in most cases changing struct Foo { map: HashMap<K,V> } to struct Foo { map: HashMap<K,V,AH> and HashMap::new() to HashMap::default() did the trick (thankfully we no longer have to manually use std::default::Default in each file!), the fact that I couldn't change HashMap::from(...) to HashMap::<_, _, AH>::from(..) was not fun. I originally included ahash in Cargo.toml with use-default-features = false because I wasn't using the ahash::HashMap "alias" but ended up having to re-add the default features just to be able to use ahash::HashMap::from(..).into() instead (again necessitating the .into() call just because it is a newtype and not an alias).

@faho
Copy link
Member

faho commented May 23, 2024

So do we have any benchmarks here?

I don't think most of these are very hot, except for maybe the function set and the wildcard sets.

(of course the difference with e.g. the highlight maps is that you don't need throughput - it's plenty fast already because it's a purely interactive thing, you want reduced resource use)

@faho
Copy link
Member

faho commented May 26, 2024

Okay, our benchmark suite (plus some additions I've been working on) doesn't think this changes a lot.

Even if I try to specifically exercise this, like I run builtin history search a bunch, or make a completion with a huge list of conditions, some duplicated and some not, I get a result that's barely above system noise (1%-2%).

That means, because this complicates the code slightly, I would be disinclined to merge it.

Of course if you have any benchmarks to the contrary (in particular I haven't found a good way to benchmark highlighting or the reader history search), that's something to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Purely performance-related enhancement without any changes in black box output rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants