Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race condition #33

Open
chris-ha458 opened this issue Aug 25, 2023 · 5 comments
Open

Potential race condition #33

chris-ha458 opened this issue Aug 25, 2023 · 5 comments

Comments

@chris-ha458
Copy link
Contributor

I have been looking into https://github.com/allenai/dolma/blob/main/src/bloom_filter.rs
Specifically how it was thread-safe

pub fn contains_hashes(&self, hashes: &Vec<u64>) -> bool {
    for hash in hashes {
        let hash = *hash as usize;
        let index = hash / 32 % self.bits.len();
        let bit = hash % 32;
        if self.bits[index].load(Ordering::Relaxed) & (1 << bit) == 0 {
            return false;
        }
    }
    return true;
}

Here, while the access to individual AtomicU32 values within the array is atomic, the method is checking a condition over multiple AtomicU32 values without a lock on the Vec. As such multiple threads could be accessing individual AtomicU32 values independently of each other in both read or write.
If one or more of the values change due to another thread's write operation during the execution of contains_hashes, it might lead to an incorrect result.

Consider the scenario where thread A is executing contains_hashes and finds that all the required bits are set. Before thread A completes execution, thread B updates one of the bits that thread A already checked. Since the check is not atomic across the entire vector, thread A's result becomes stale, and the method might return false when it should return true.
This problem is compounded if there are more threads exist changing other items that have already been checked by other threads.
If there are a higher degree of true positives(duplicates) the issue is also exacerbated.
This is particularly concerning since Bloom filters are not supposed to produce false negatives, the problem becomes worse as we try to increase threads or the underlying data has a lot of duplicates.

In summary, The code exhibits a race condition where the result of the contains_hashes method can be affected by concurrent updates to the bits vector. The individual accesses to AtomicU32 are atomic, but the method's logic requires a consistent view of multiple AtomicU32 values, and this consistency is not guaranteed.

To correct this, a higher-level synchronization mechanism (e.g., a read-write lock on the whole of the Vec) might be required to ensure that the entire check operation in contains_hashes is atomic with respect to updates to the bits vector.

This is the approach taken by another rust crate ofilter.
Relevant source

pub struct SyncBloom<T> {
    inner: Arc<RwLock<Bloom<T>>>,
}

This is an approach taken by google's Guava library for Go
Their solution is not complete nor comprehensive, with a lot of tradeoffs in the datastructures, types and methods operating upon them and tests to validate that their tradeoffs do not cause too much error.
(I do not like this approach but apparently it is good enough for them).

@chris-ha458
Copy link
Contributor Author

This has led me to wonder as to the choice of bloomfilter for this purpose.
Atleast for a document level approach often python with pandas/numpy is good enough even with predominantly python based data structures. Refer to : https://github.com/ChenghaoMou/text-dedup
Instead of trying to make minor changes, I think i will try to implement a multithreaded set like dashmap for an exact membership filter.

Usually the overhead is dozens of bytes per item, and when operating on a per language basis it is good enough even for web scale large datasets(~1B items).

This is definitely different in a paragraph level basis, but it seems to me that the current code treats individual sentences (text.split('\n')) as a paragraph.
In this case items to index would increase by atleast an order of magnitude, which might not be as easy to deal with.

But in this case, there are better approaches, such as suffix arrays or text normalization to increase chance of collision and therefore removal

@dirkgr
Copy link
Member

dirkgr commented Sep 22, 2023

The race condition is acknowledged and I deemed it a worthwhile trade-off for performance. At typical sizes for the filter (1TB last time I ran it), the chance of two insertions colliding in a way that matters are astronomical. And even if it does collide, the results are not catastrophic. It might make the wrong decision about a particular n-gram. Not a big deal. It would be a big deal if this happened systematically depending on the content, but it does not. The chance of collisions due to a straightforward hash collision, or due to the fuzzy nature of bloom filtering itself, are much higher.

It bothers me that this means the bloom filter is not deterministic, but that ship already sailed when we decided to run multiple threads at the same time. With the bloom filter approach, the order of documents always matters, and with threading, the order is not deterministic. This is going to have a much bigger effect than the occasional race condition.

@dirkgr
Copy link
Member

dirkgr commented Sep 22, 2023

You're welcome to try other approaches, but consider some numbers before you try: If we're running this on all 2T tokens, we can use a 2TB bloom filter to achieve a false positive rate of about 2%. Close enough. I'd prefer a 4TB filter and get 0.1%, but those machines are hard to get in AWS (and in practice we don't run it over all of Dolma anyways).

However, if you do this with an exact membership filter, and you use only 64-bit hashes, you'd need 16TB just for the hashes alone.

@chris-ha458
Copy link
Contributor Author

Yes. I think your answer is what I was looking for and I generally agree.

In this context that could be addressed by documenting this somewhere in the code or documentation.
For instance, Google's Guava has a similar issue but has robust testing implemented as well.
Such documentation and testing would help to communicate design decisions. It would also help address any code changes introducing further unexpected behavior from the known and considered.

For exact membership filters the ones i'm familiar with are not ngram or even paragraph level and more associated with document level. In this case what I had in mind was to use exact hash in per document per language context.

@chris-ha458
Copy link
Contributor Author

The main reason i opened this issue in the first place was to confirm my suspicions whether there was a race considering how difficult reasoning concurrency was. I wasn't sure if this actually existed.

From discord communications with @soldni etc, it seems like it was identified but also considered acceptable.
I agree in an Intuitive sense it seems like it wouldnt be a big problem. If and when encapsulation and implementation of bff can be done, hopefully testing could be implemented investigating the effects as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants