Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change bloom_filter implementation of hash #26

Open
chris-ha458 opened this issue Aug 20, 2023 · 11 comments
Open

Change bloom_filter implementation of hash #26

chris-ha458 opened this issue Aug 20, 2023 · 11 comments
Assignees

Comments

@chris-ha458
Copy link
Contributor

Currently, bloom_filter.rs implements ahash for the internal hasher.

This is problematic since ahash has an unstable representation:

different computers or computers on different versions of the code will observe different hash values. As such, aHash is not recommended for use other than in-memory maps. Specifically, aHash is not intended for network use or in applications which persist hashed values.

I would love to learn if the dolma developers have found a way to serialize it in a way that maintains some kind of portability, but that is not a supported use case and I feel there is benefit in moving to a stable hash.

Recommendations

  • Rust's Default hash is ensured to be reasonably fast and cryptographically secure. Currently it is siphash1-3 and it supports keyed hashing (which can be used as a seeded hash)
  • Blake3 is one of the fastest if not the fastest cryptographic hash. It also supports keyed hashing
  • xxhash (more specifically xxh3 iteration) is one of the fastest if not the fastest hasher that passes SMHasher.
@soldni
Copy link
Member

soldni commented Aug 20, 2023

Tagging @dirkgr since bloom filter is his baby.

@dirkgr
Copy link
Member

dirkgr commented Aug 21, 2023

I did not know this about ahash. This bloom filter is the first program I ever wrote in Rust, and it probably shows. I would love to move to a stable hash.

@chris-ha458
Copy link
Contributor Author

If that is the case, please take a look into my previous PRs (#24 and especially #23) and take your time integrating them in a manner that you see fit.

Most of my commits from #23 comes from the outputs of cargo clippy.
I can easily have done it in one fell swoop with cargo clippy --fix but I manually changed them one by one.
This was because I felt that there was a reason the code was arranged the way it was (in this case it is due to the python background of one of the developers).
I recognized this myself because that is how I, and many others, code rust when they first approach it, especially from a python background.

In my experience the results of cargo clippy is more often then not what I wanted while being idiomatic and performant.

Going back to the subject, please review the changes in the PR #23. If they are found to be acceptable,
I will prepare further PRs along those lines, eventually leading up to addressing hash alternatives.

@chris-ha458
Copy link
Contributor Author

#24 was pulled but I feel like a bit more is necessary so I did #31.
#23 is under review as well.
When #23 is merged, I will do some more general rust related fixes including adding more unit tests.

In adding tests into bloom_filter.rs we can achieve several goals : Not break anything, find how tightly coupled the specific hash choice is to the current code, identify ways to separate the code into units for both testing and replacement opportunities.

To share specific ideas on the implementation itself, I plan to integrate either xxhash or highwayhash which are both widely used, high speed, low DOS, well maintained hashes. depending on the cost of effort of development, I might also add blake3 which is a secure hash.
I also plan to use only 2 hashes following Less Hashing, Same Performance: Building a Better Bloom Filter
(we only need one hashing algorithm if we have a keyed hash, which is all of the above)

Some more decisions need to be made regarding scalableBloomFilters but this isn't something I can just hackaway alone since it would require architectural decisions, but lets cross that bridge when we arrive there.

@dirkgr
Copy link
Member

dirkgr commented Sep 13, 2023

A bunch of these have been merged, but this one is still open. What's next for this PR?

@dirkgr
Copy link
Member

dirkgr commented Sep 13, 2023

All of these improvements should also be merged into the standalone tool at https://github.com/allenai/bff.

@chris-ha458
Copy link
Contributor Author

A bunch of these have been merged, but this one is still open. What's next for this PR?
Tbh I don't know. Some of my assumptions and opinions have changed since the original issue and PRs
But it seems like I am not sure how I would approach the bloomfilter / exact hash approach as is.
Fundamentally this is due to the fact that I am not sure of what kind of datasets we are specifically dealing with.

Another block is #33 I might be wrong about the potential race condition, or I am right but allenai considers current behavior as "good enough" considering the already fuzzy nature of bloom.
I can only provide further input if I can try it out on my own and extract firm and clear data

So i have been trying to replicate some pipeline work with CC release slices so as to test and approach the pipeline on my own, but it has been stalled for a while.
Further work on this part will follow when I have actual data working that I can test the current/ potential filters

All of these improvements should also be merged into the standalone tool at https://github.com/allenai/bff.

This is the first time I heard of this and will look into that codebase as well

@soldni
Copy link
Member

soldni commented Sep 14, 2023

I am okay with merging in #39, but yeah the goal is to eventually merge back into the BFF repo and release that as a crate.

@dirkgr given your schedule, do you have thoughts on letting implementations (dolma vs bff) diverge for a while?

@chris-ha458 are you blocked on #39 not being merged and/or other issues with the rust code?

@chris-ha458
Copy link
Contributor Author

I am not blocked on #39. That can wait.
I am blocked on #33 and the fact that I cannot replicate the intended workflow of dolma atm.
(This part might not be dolma's fault and my inability to understand the intended process, but in this case further documentation would be helpful)

i've given a cursory look into bff but it seems to follow a very similar implementation that is found here.
If it would help, i'd be happy to work on that for the bloom filter part.
However from what brief analysis I could do, #33 still applies.
It seems to me that there is a race condition there as well (because it is similar to code here).

If this is known and accepted (I disagree, but there are cases such as Guava where they accept the possibility of race) I would want to see testing suite (either built by allenai or me) to validate when it happens and by how much that is the only way to tell if it is "Acceptable".

This goes back to my attempts at trying to replicate the pipleline on my device.

@soldni
Copy link
Member

soldni commented Sep 14, 2023

If you don't mind fixing the racing condition, I'll be happy to merge a PR. I don't think we have strong opinions about which solution to use, so I'm happy for you to call it as you see fit :)

For an end to end example, there are two guides I can point you to at the moment. They use an older, internal codebase (allenai/LLM) that we are slowly partitioning out into more user-friendly public code bases. The CLI and configuration syntax has some major differences, but they are a good starting point to figure out how to get around.

  1. Cookbook for OLMo ablations Section 2 is relevant for tagging, deduping, and mixing.
  2. Wikipedia ablation example

I'm happy to chat on discord if you have questions. The plan is to refresh them for dolma over the weekend or next week.

@dirkgr
Copy link
Member

dirkgr commented Oct 17, 2023

This is now an old thread, but I just noticed this, and for @soldni 's benefit, I think I have to address the race condition: It is by design, and in fact key to the way BFF achieves its performance. The result of the race condition is not a crash or data corruption, it is just non-deterministic when it decides which duplicate to keep and which duplicate to throw away. In very rare cases it might keep one duplicate that should have been thrown out. Not a big deal for the performance gain you get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants