Change bloom_filter implementation of hash #26

chris-ha458 · 2023-08-20T13:34:08Z

Currently, bloom_filter.rs implements ahash for the internal hasher.

This is problematic since ahash has an unstable representation:

different computers or computers on different versions of the code will observe different hash values. As such, aHash is not recommended for use other than in-memory maps. Specifically, aHash is not intended for network use or in applications which persist hashed values.

I would love to learn if the dolma developers have found a way to serialize it in a way that maintains some kind of portability, but that is not a supported use case and I feel there is benefit in moving to a stable hash.

Recommendations

Rust's Default hash is ensured to be reasonably fast and cryptographically secure. Currently it is siphash1-3 and it supports keyed hashing (which can be used as a seeded hash)
Blake3 is one of the fastest if not the fastest cryptographic hash. It also supports keyed hashing
xxhash (more specifically xxh3 iteration) is one of the fastest if not the fastest hasher that passes SMHasher.

soldni · 2023-08-20T23:26:32Z

Tagging @dirkgr since bloom filter is his baby.

dirkgr · 2023-08-21T00:03:18Z

I did not know this about ahash. This bloom filter is the first program I ever wrote in Rust, and it probably shows. I would love to move to a stable hash.

chris-ha458 · 2023-08-21T00:56:51Z

If that is the case, please take a look into my previous PRs (#24 and especially #23) and take your time integrating them in a manner that you see fit.

Most of my commits from #23 comes from the outputs of cargo clippy.
I can easily have done it in one fell swoop with cargo clippy --fix but I manually changed them one by one.
This was because I felt that there was a reason the code was arranged the way it was (in this case it is due to the python background of one of the developers).
I recognized this myself because that is how I, and many others, code rust when they first approach it, especially from a python background.

In my experience the results of cargo clippy is more often then not what I wanted while being idiomatic and performant.

Going back to the subject, please review the changes in the PR #23. If they are found to be acceptable,
I will prepare further PRs along those lines, eventually leading up to addressing hash alternatives.

chris-ha458 · 2023-08-22T12:11:09Z

#24 was pulled but I feel like a bit more is necessary so I did #31.
#23 is under review as well.
When #23 is merged, I will do some more general rust related fixes including adding more unit tests.

In adding tests into bloom_filter.rs we can achieve several goals : Not break anything, find how tightly coupled the specific hash choice is to the current code, identify ways to separate the code into units for both testing and replacement opportunities.

To share specific ideas on the implementation itself, I plan to integrate either xxhash or highwayhash which are both widely used, high speed, low DOS, well maintained hashes. depending on the cost of effort of development, I might also add blake3 which is a secure hash.
I also plan to use only 2 hashes following Less Hashing, Same Performance: Building a Better Bloom Filter
(we only need one hashing algorithm if we have a keyed hash, which is all of the above)

Some more decisions need to be made regarding scalableBloomFilters but this isn't something I can just hackaway alone since it would require architectural decisions, but lets cross that bridge when we arrive there.

dirkgr · 2023-09-13T22:51:01Z

A bunch of these have been merged, but this one is still open. What's next for this PR?

dirkgr · 2023-09-13T22:52:14Z

All of these improvements should also be merged into the standalone tool at https://github.com/allenai/bff.

chris-ha458 · 2023-09-14T00:08:10Z

A bunch of these have been merged, but this one is still open. What's next for this PR?
Tbh I don't know. Some of my assumptions and opinions have changed since the original issue and PRs
But it seems like I am not sure how I would approach the bloomfilter / exact hash approach as is.
Fundamentally this is due to the fact that I am not sure of what kind of datasets we are specifically dealing with.

Another block is #33 I might be wrong about the potential race condition, or I am right but allenai considers current behavior as "good enough" considering the already fuzzy nature of bloom.
I can only provide further input if I can try it out on my own and extract firm and clear data

So i have been trying to replicate some pipeline work with CC release slices so as to test and approach the pipeline on my own, but it has been stalled for a while.
Further work on this part will follow when I have actual data working that I can test the current/ potential filters

All of these improvements should also be merged into the standalone tool at https://github.com/allenai/bff.

This is the first time I heard of this and will look into that codebase as well

soldni · 2023-09-14T00:17:31Z

I am okay with merging in #39, but yeah the goal is to eventually merge back into the BFF repo and release that as a crate.

@dirkgr given your schedule, do you have thoughts on letting implementations (dolma vs bff) diverge for a while?

@chris-ha458 are you blocked on #39 not being merged and/or other issues with the rust code?

chris-ha458 · 2023-09-14T00:24:23Z

I am not blocked on #39. That can wait.
I am blocked on #33 and the fact that I cannot replicate the intended workflow of dolma atm.
(This part might not be dolma's fault and my inability to understand the intended process, but in this case further documentation would be helpful)

i've given a cursory look into bff but it seems to follow a very similar implementation that is found here.
If it would help, i'd be happy to work on that for the bloom filter part.
However from what brief analysis I could do, #33 still applies.
It seems to me that there is a race condition there as well (because it is similar to code here).

If this is known and accepted (I disagree, but there are cases such as Guava where they accept the possibility of race) I would want to see testing suite (either built by allenai or me) to validate when it happens and by how much that is the only way to tell if it is "Acceptable".

This goes back to my attempts at trying to replicate the pipleline on my device.

soldni · 2023-09-14T00:49:40Z

If you don't mind fixing the racing condition, I'll be happy to merge a PR. I don't think we have strong opinions about which solution to use, so I'm happy for you to call it as you see fit :)

For an end to end example, there are two guides I can point you to at the moment. They use an older, internal codebase (allenai/LLM) that we are slowly partitioning out into more user-friendly public code bases. The CLI and configuration syntax has some major differences, but they are a good starting point to figure out how to get around.

Cookbook for OLMo ablations Section 2 is relevant for tagging, deduping, and mixing.
Wikipedia ablation example

I'm happy to chat on discord if you have questions. The plan is to refresh them for dolma over the weekend or next week.

dirkgr · 2023-10-17T17:55:04Z

This is now an old thread, but I just noticed this, and for @soldni 's benefit, I think I have to address the race condition: It is by design, and in fact key to the way BFF achieves its performance. The result of the race condition is not a crash or data corruption, it is just non-deterministic when it decides which duplicate to keep and which duplicate to throw away. In very rare cases it might keep one duplicate that should have been thrown out. Not a big deal for the performance gain you get.

soldni assigned dirkgr Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change bloom_filter implementation of hash #26

Change bloom_filter implementation of hash #26

chris-ha458 commented Aug 20, 2023

soldni commented Aug 20, 2023

dirkgr commented Aug 21, 2023

chris-ha458 commented Aug 21, 2023

chris-ha458 commented Aug 22, 2023

dirkgr commented Sep 13, 2023

dirkgr commented Sep 13, 2023

chris-ha458 commented Sep 14, 2023

soldni commented Sep 14, 2023

chris-ha458 commented Sep 14, 2023

soldni commented Sep 14, 2023

dirkgr commented Oct 17, 2023

Change bloom_filter implementation of hash #26

Change bloom_filter implementation of hash #26

Comments

chris-ha458 commented Aug 20, 2023

Recommendations

soldni commented Aug 20, 2023

dirkgr commented Aug 21, 2023

chris-ha458 commented Aug 21, 2023

chris-ha458 commented Aug 22, 2023

dirkgr commented Sep 13, 2023

dirkgr commented Sep 13, 2023

chris-ha458 commented Sep 14, 2023

soldni commented Sep 14, 2023

chris-ha458 commented Sep 14, 2023

soldni commented Sep 14, 2023

dirkgr commented Oct 17, 2023