Faster kmer hashing #71

ffranr · 2017-08-31T08:52:33Z

This was first mentioned on Slack by Sorina: https://iqbalgroup.slack.com/archives/C5MANUB17/p1504131606000222

This link shows benchmarking of different hash functions for random and sequential integer inserts:
http://incise.org/hash-table-benchmarks.html

Gramtools currently uses the Boost hash function. Benchmarking results show that this is the slowest algorithm when > 20 mil integers are inserted. When handling the human PRG, gramtools attempts to index 60 mil kmers.

Several other libraries provide hashing functions which are apparently faster than our current implementation.

sm0179 · 2017-08-31T09:01:09Z

duplicate of #47

iqbal-lab · 2017-08-31T09:13:10Z

The Bob Jenkins hash works fine in Cortex. BooPHF/related use a minimal perfect hash for much bigger sets and are fast

ffranr · 2018-06-05T13:04:06Z

Copied over the post from the duplicate:
@iqbal-lab said:

I don't know the exact RAM usage by our kmer hash, but for future reference, we could replace it with https://github.com/rob-p/BooM which is a minimal perfect hash map based on Rayan's BBhash - uses about 3bits per element apparently. So for human, if we want to store 13mers, we need to store 66 million kmers--->this would only take 25Mb

Closing duplicate #47 .

ffranr · 2018-06-06T09:49:19Z

The hash function we currently used is here:

gramtools/libgramtools/include/common/utils.hpp

Line 20 in 72786ae

boost::hash_range(hash, seq.begin(), seq.end());

leoisl · 2019-02-15T11:36:34Z

hello,

just restarting this discussion, since there is a new interesting related paper, and it is also nice to start looking around gramtools code & issues.

I guess https://github.com/rob-p/BooM is a nice option, and makes BBHash an associative index. The problem is that it stores all the keys in a std::vector<std::pair<KeyT, ValueT>> to filter out false positives, so it can be heavy in memory. However, we do have an FM-index backing up this MPHF in gramtools, so we can filter out FPs with one lookup in an associated SearchState.

There is also now the option of using BLight (https://github.com/Malfoy/Blight, https://www.biorxiv.org/content/10.1101/546309v2), which wraps BBHash in a more lightweight associative index, but still increases memory usage to ~28 bits per k-mer. Good to keep this in mind, but I guess it might be better to use the FM-index as backup.

I guess it can also translate into faster queries than the current implementation, which uses std::unordered_map due to boost hash. With BBHash, std::vector is enough due to no collision.

So, I'd vote for a custom BBHash wrapper, similar to https://github.com/rob-p/BooM, but backed up by the (already constructed) gramtools' FM-index to filter out FPs. Do you see any issue or better options than this one?

ekg · 2019-02-15T13:27:31Z

A pair perfect hash function linking two perfect hashes of the same data with different base hash functions might provide a way to get a set with membership queries in a fixed number of bits per entry regardless of key length. It would have some theoretically low false positive rate assuming random lookups.

The construction begins by making two perfect minimal hash tables, each for a different hash of the keys. Then we link the tables together by recording which entry each of one of them maps to in the other. The total structure should take N(2b + log N) where b is the number of bits per entry in the perfect hash tables and N is the number of elements. Assuming billions of keys and 2 bits per entry in the perfect hash tables, this would be greater than Blight (I still need to read that paper). It's not clear if performance would be better or worse. It would really help for big numbers of long keys. If the hash projections are random then the false positive rate should be something like 1/N.

ekg · 2019-02-15T13:33:59Z

Also, definitely test out https://github.com/skarupke/flat_hash_map and https://github.com/greg7mdp/sparsepp.

I'd love to hear about any other alternatives. In my application I am mostly limited by lookup speed, and can handle a larger construction cost. That said, I found perfect hashing to be a little slower than I'd expected. If you find or do any benchmarks I'd be interested.

leoisl · 2019-02-15T14:23:30Z

A pair perfect hash function linking two perfect hashes of the same data with different base hash functions might provide a way to get a set with membership queries in a fixed number of bits per entry regardless of key length. It would have some theoretically low false positive rate assuming random lookups.

The construction begins by making two perfect minimal hash tables, each for a different hash of the keys. Then we link the tables together by recording which entry each of one of them maps to in the other. The total structure should take N(2b + log N) where b is the number of bits per entry in the perfect hash tables and N is the number of elements. Assuming billions of keys and 2 bits per entry in the perfect hash tables, this would be greater than Blight (I still need to read that paper). It's not clear if performance would be better or worse. It would really help for big numbers of long keys. If the hash projections are random then the false positive rate should be something like 1/N.

Indeed something interesting to test. I would guess queries would be very fast with this. I think that in real datasets, N should be large, so 1/N is fine.

However, in the example discussed above:

So for human, if we want to store 13mers, we need to store 66 million kmers

this would take 251 MB (using b=3 from BBHash) which would be even more than storing 66M k-mers explicitly in a 2-bit encoding (228 MB). I don't know the good length of k-mers to be indexed in gramtools (really new to it), but in case of low values of k or N, https://github.com/pierrepeterlongo/quasi_dictionary seems a good alternative.

leoisl · 2019-02-15T14:26:27Z

Also, definitely test out https://github.com/skarupke/flat_hash_map and https://github.com/greg7mdp/sparsepp.

I'd love to hear about any other alternatives. In my application I am mostly limited by lookup speed, and can handle a larger construction cost. That said, I found perfect hashing to be a little slower than I'd expected. If you find or do any benchmarks I'd be interested.

Thanks for the suggestions! I shall take a look at these. I will keep you posted in case of any benchmarks. Please feel free to share more about hashing!!

Thanks!

leoisl · 2019-02-15T20:23:16Z

update: I had in mind that https://github.com/rob-p/BooM was not an associative index wrapper for BBHash, but it is... edited #71 (comment) accordingly

ekg · 2019-02-16T07:59:21Z

The pairPMHF wouldn't be too useful when storing 13-mers, because they are smaller than the mapping that's needed between the PMHFs. But if the keys are very big then the savings would become significant. I don't know what kmers size that would have to be. Maybe all the 71-kmers in a read set would be an interesting thing to store in a structure like this.

The n log n factor could be tuned to a given false positive rate. For instance, if you could accept FP at 1/256 then you'd need only 8 bits per entry.

ekg · 2019-02-19T09:33:53Z

Ah so the quasi dictionary is sufficient. What a silly idea I had.

A checksum or other hash into any small number of bits would provide the same guarantees. This checksum could be made from another PMHF, which is interesting but maybe overly complicated.

leoisl · 2019-02-20T16:42:20Z

Hello!

Sorry for the delay on answering! It is nice to have the pairMPHF in mind, and a checksum from another MPHF could yield some interesting stuff, it can be useful in some situations I guess.

I thought about using a classical Bloom Filter (BF) together with a MPHF to have membership queries with low FPs, but the Quasi Dictionary (QD) uses less memory than classical Bloom Filters with the same rate of FPs, and it is probably faster, since only one hash is needed.

I guess the lightest structure would still be MPHF backed up by the FM-index (which is already constructed). This would take only additional ~3bits/kmer. The problem is that filtering out FPs requires k queries to the wavelet tree, which can be costly... less queries can be made at the cost of admitting some FPs.

The QD is probably faster than the previous approach, taking more space (~(3+f)bits/kmer with a FP of 1/2^f).
https://github.com/rob-p/BooM is a QD where f = 2*k so I think we can skip it.

Other structures to check (probably just a benchmark is going to give us concrete answers):

BLight (https://github.com/Malfoy/Blight, https://www.biorxiv.org/content/10.1101/546309v2) - ~28 bits per k-mer, no FP
https://github.com/skarupke/flat_hash_map
https://github.com/greg7mdp/sparsepp

I shall keep searching for additional alternatives...

iqbal-lab · 2019-02-21T10:19:36Z

I look forward to chatting about this when I get back next week.
However, we should be driven by profiling. I think enumeration of kmers
In graph is more of a problem than the hash.

ffranr added the enhancement label Aug 31, 2017

ffranr assigned ffranr, iqbal-lab and sm0179 Aug 31, 2017

sm0179 added the duplicate label Aug 31, 2017

ffranr unassigned ffranr, iqbal-lab and sm0179 May 30, 2018

ffranr mentioned this issue Jun 5, 2018

Faster/lower RAM kmer hash #47

Closed

ffranr removed the duplicate label Jun 5, 2018

ffranr changed the title ~~Performance enhancement for kmer hashing~~ Faster kmer hashing Jun 5, 2018

johnlees mentioned this issue Jan 9, 2020

Improve k-mer filter bacpop/pp-sketchlib#11

Closed

bricoletc added this to To Do in Release 1.9 Feb 3, 2020

bricoletc removed this from To Do in Release 1.9 Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster kmer hashing #71

Faster kmer hashing #71

ffranr commented Aug 31, 2017

sm0179 commented Aug 31, 2017

iqbal-lab commented Aug 31, 2017

ffranr commented Jun 5, 2018 •

edited

ffranr commented Jun 6, 2018

leoisl commented Feb 15, 2019 •

edited

ekg commented Feb 15, 2019 •

edited

ekg commented Feb 15, 2019

leoisl commented Feb 15, 2019

leoisl commented Feb 15, 2019

leoisl commented Feb 15, 2019

ekg commented Feb 16, 2019

ekg commented Feb 19, 2019

leoisl commented Feb 20, 2019

iqbal-lab commented Feb 21, 2019

Faster kmer hashing #71

Faster kmer hashing #71

Comments

ffranr commented Aug 31, 2017

sm0179 commented Aug 31, 2017

iqbal-lab commented Aug 31, 2017

ffranr commented Jun 5, 2018 • edited

ffranr commented Jun 6, 2018

leoisl commented Feb 15, 2019 • edited

ekg commented Feb 15, 2019 • edited

ekg commented Feb 15, 2019

leoisl commented Feb 15, 2019

leoisl commented Feb 15, 2019

leoisl commented Feb 15, 2019

ekg commented Feb 16, 2019

ekg commented Feb 19, 2019

leoisl commented Feb 20, 2019

iqbal-lab commented Feb 21, 2019

ffranr commented Jun 5, 2018 •

edited

leoisl commented Feb 15, 2019 •

edited

ekg commented Feb 15, 2019 •

edited