optimize modulo calculation #214

chris-ha458 · 2023-08-02T13:31:19Z

changing this order will make it much faster especially for larger arrays

chris-ha458 · 2023-08-02T13:38:02Z

re : #212

How do you want me to handle xxhash importing?

chris-ha458 · 2023-08-05T04:55:57Z

I'll focus on general minhash improvements here instead of alternative hashing schemes for this pr

ekzhu · 2023-08-07T04:42:37Z

changing this order will make it much faster especially for larger arrays

Thanks! Does this change lead to different hash values?

ekzhu · 2023-08-07T04:44:00Z

re : #212

How do you want me to handle xxhash importing?

See my comment in: #212 (comment).

We let user choose their hash functions through MinHash's init parameter.

ekzhu · 2023-08-07T04:45:08Z

I'll focus on general minhash improvements here instead of alternative hashing schemes for this pr

Thanks! I think as a next step, we can think about a good abstraction for users to specify the permutation calculation.

chris-ha458 · 2023-08-07T05:21:35Z

i have been working on this on a private branch (as well as another implementation with different project goals) but i am having some difficulty making the changes while respecting the needs and goals of this repo.
My thoughts was just showing you the modified version not as a commits to minhash.py but rather as a different file minhashX.py or similar.

Do you prefer I still make the commits directly on minhash.py or maybe show it to you separately?
(my goals is to still bring most if not all changes into minhash.py eventually)

ekzhu · 2023-08-07T05:48:56Z

i have been working on this on a private branch (as well as another implementation with different project goals) but i am having some difficulty making the changes while respecting the needs and goals of this repo. My thoughts was just showing you the modified version not as a commits to minhash.py but rather as a different file minhashX.py or similar.

Do you prefer I still make the commits directly on minhash.py or maybe show it to you separately? (my goals is to still bring most if not all changes into minhash.py eventually)

Thanks for your contribution. I think whichever way works for you is the best. Please take my comments as suggestion only.

This shouldn't diverge but they do.

This reverts commit 14e54ba.

although the intermediate calculations are meaningful to be calculated with higher dtypes, the final calculated hashes should be stored and operated upon with the same amount of bits as the original has (32bits)

chris-ha458 · 2023-08-07T14:19:47Z

I don't know why i've operated on the assumption that dividend mod Mersenne prime = dividend bitwise_and mersenne prime
It does look like it might be right but atleast, changing the mod with bitwise and causes tests to fail.

anyway, even without that assumption there are improvements to be made.

since we start with sha1_32hash which only has 32 bits, and ensure that the resulting numbers have only 32 bits or less (np.bitwise_and with maxhash) there is no need for the numbers to keep np.uint64 after permutations have been done.

I think i can also get away with having a,b permutation array as np.uint64 but I put them into separate commits.

initial value/mask can definitely be np.uint32 as well

maxhash appropriate sized datatype

chris-ha458 · 2023-08-07T14:36:53Z

As you can see this is very rudimentary. There are some assumptions that are practical yet imperfect.
One is the assumption that datatypes will be 32bit. You have shared that you wish the user should be able to choose.
How would you like (me) to implement that option?

ekzhu

Looks good! I am wondering if we can have a "playground" inside test to verify the equivalance of different permutation code. e.g., the old version vs. the new one but result in the same hash values.

ekzhu · 2023-08-07T20:05:57Z

We can focus on the performance improvement over existing permutation calculation in this PR. We can open a separate PR for customization and decide there.

chris-ha458 · 2023-08-07T22:25:21Z

Focusing on performance improvements sounds nice.

If these changes prove fruitful we can explore customizations options as necessary.

As for playground that sounds like a good idea.

Text-dedup repo uses pinecone dataset for precision,recall,F1 accuracy measurements and runtime.

It might be a better idea to pause performance improvements until we have a good benchmark to compare with.

That is a much larger effort though, and I don't know how you would like to proceed. Do you have any thoughts or pointers on how to implement a benchmark here?

why use the mersenne prime if we are not going to use the mersenne trick?

chris-ha458 · 2023-08-08T06:59:00Z

so. this needs some explanation.

The reason mersenne primes are used in the first place is that they have a very unique property
if M = 2^k - 1 is prime ( a mersenne prime) the following is true:
n modulo M == n (bitwise_and) M + n >>(bitwise rightshift) k
when Universal classes of hash functions paper was written, modulo was orders of magnitude slower than bitwise operations. As such, directly calculating the left hand was much slower than right hand.

Thanks to modern vectorized instructions and byte based primitives this is less true.
Most of my testing showed positive results but whether that will be true in general and under various conditions needs to be seen ( the playground / benchmark idea would be useful here)

If you feel that this code is too ugly, unperformant, unmaintainable, etc that is 100% fine.

Then, my argument becomes : there is no reason to use mersenne primes in that case. we can use larger primes
(2**64 - 59) for instance. We will have exactly the same performance as before and scrape back the lost range due to the modulo mersenne prime being smaller than it could be by a factor of 8

ekzhu · 2023-08-09T16:26:18Z

Focusing on performance improvements sounds nice.

If these changes prove fruitful we can explore customizations options as necessary.

As for playground that sounds like a good idea.

Text-dedup repo uses pinecone dataset for precision,recall,F1 accuracy measurements and runtime.

It might be a better idea to pause performance improvements until we have a good benchmark to compare with.

That is a much larger effort though, and I don't know how you would like to proceed. Do you have any thoughts or pointers on how to implement a benchmark here?

We have a microbenchmark for MinHash in the benchmark directory, which can be further improved with more realistic workloads -- much like those you have been using. https://github.com/ekzhu/datasketch/blob/master/benchmark/sketches/minhash_benchmark.py

ekzhu · 2023-08-09T16:32:53Z

Then, my argument becomes : there is no reason to use mersenne primes in that case. we can use larger primes
(2**64 - 59) for instance. We will have exactly the same performance as before and scrape back the lost range due to the modulo mersenne prime being smaller than it could be by a factor of 8

This is an interesting case to make. I think we can benefit from a simple test module in test/test_hashing.py to play around with various hashing and modulo calculation.

As for the code change, I think if it helps with performance to use this new trick over np.mod than it can be the default. I am mostly worried about backward compatibility.

chris-ha458 · 2023-08-09T23:35:27Z

As for the code change, I think if it helps with performance to use this new trick over np.mod than it can be the default. I am mostly worried about backward compatibility.

np.uint32 change -> numerically there is complete backward compatibility, but due to type change, compatibility is broken for practical usages(especially for loading a previously picked minhash and using it). except probably changing types of max_hash but its redundant since we recast back to np.uint64 anyway
mersenne trick -> if we decide to maintain np.uint64 there is no backward compatibility issue. it is bit by bit the same including type. performace gain is marginal though (this trick was developed before modern CPUs or autovectorization over data structures)
(not implemented yet ) keeping np.mod but using higher primes : although compatible datatype wise, all hash values, including when we keep the seed same are not preserved. it would not be backwards compatible.
changing minor ops (things like using np.full instead of np.ones, changing overloaded ops to explicit function calls etc) bit by bit the same and marginal performance benefits (it does skip some python dispatches, but those are really fast anyway)

I have thought about finding a way to preserve backward compatibility and performance through selections, but it seems very difficult. It is not that the technical side is difficult, but the code becomes very difficult to read and understand what is happening.

I think that is why i return to my original proposal. we keep the backwards compatible minhash.py as it is, and add something like minhash_faster.py(I don't insist on the name. it could be minhash_incompatible.py or anything else.) that aggressively applies accelerations in both calculations and datatypes that we can apply without backwards compatibility as a goal.

Meanwhile, I will try to implement the test/test_hashing.py you discuss.
It would capture the core computation of the permutation process
phv = np.bitwise_and(np.mod((a * hv + b), _mersenne_prime), _max_hash)
and change each computation and datatypes and compare their numerical results and speed.

ekzhu · 2023-08-17T19:07:34Z

Thanks for your input. This is very valuable. I think we can choose simple alternative option in MinHash constructor (e.g., fast_perm: bool = False), when it is True, it uses all the advanced changes you proposed, including using largest prime. For the original code path, we simply apply all the backward compatible changes that improve the performance.

This way there is only two code paths: (1) one code path to aggressively adapt to the latest permutation trick, without worrying about backward compatibility; and (2) classic minhash permutation algorithm, backward compatible, but performance optimized.

In the documentation we can explain the option as breaking changes for older minhashes -- "use with care".

ekzhu · 2023-11-29T05:38:58Z

@chris-ha458 are you still interested in this PR?

optimize modulo calculation

45f85f6

changing this order will make it much faster especially for larger arrays

ekzhu self-requested a review August 7, 2023 04:27

chris-ha458 added 3 commits August 7, 2023 18:50

ATTENTION-this causes minor errors

14e54ba

This shouldn't diverge but they do.

Revert "ATTENTION-this causes minor errors"

b73e841

This reverts commit 14e54ba.

change dtypes

baf82af

although the intermediate calculations are meaningful to be calculated with higher dtypes, the final calculated hashes should be stored and operated upon with the same amount of bits as the original has (32bits)

chris-ha458 added 2 commits August 7, 2023 23:21

change initial value

82e1a54

initial value/mask can definitely be np.uint32 as well

Update minhash.py

8db4dd7

maxhash appropriate sized datatype

ekzhu requested changes Aug 7, 2023

View reviewed changes

chris-ha458 added 2 commits August 8, 2023 15:08

mersenne trick step1 formatting and documentation

4de2662

mersenne trick step 2. bitwise and and right shift

a402a5b

why use the mersenne prime if we are not going to use the mersenne trick?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize modulo calculation #214

optimize modulo calculation #214

chris-ha458 commented Aug 2, 2023

chris-ha458 commented Aug 2, 2023

chris-ha458 commented Aug 5, 2023

ekzhu commented Aug 7, 2023

ekzhu commented Aug 7, 2023

ekzhu commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

ekzhu commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

ekzhu left a comment

ekzhu commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

chris-ha458 commented Aug 8, 2023

ekzhu commented Aug 9, 2023

ekzhu commented Aug 9, 2023

chris-ha458 commented Aug 9, 2023

ekzhu commented Aug 17, 2023

ekzhu commented Nov 29, 2023

optimize modulo calculation #214

Are you sure you want to change the base?

optimize modulo calculation #214

Conversation

chris-ha458 commented Aug 2, 2023

chris-ha458 commented Aug 2, 2023

chris-ha458 commented Aug 5, 2023

ekzhu commented Aug 7, 2023

ekzhu commented Aug 7, 2023

ekzhu commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

ekzhu commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

ekzhu left a comment

Choose a reason for hiding this comment

ekzhu commented Aug 7, 2023

chris-ha458 commented Aug 7, 2023

chris-ha458 commented Aug 8, 2023

ekzhu commented Aug 9, 2023

ekzhu commented Aug 9, 2023

chris-ha458 commented Aug 9, 2023

ekzhu commented Aug 17, 2023

ekzhu commented Nov 29, 2023