Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster kmer hashing #71

Open
ffranr opened this issue Aug 31, 2017 · 14 comments
Open

Faster kmer hashing #71

ffranr opened this issue Aug 31, 2017 · 14 comments

Comments

@ffranr
Copy link
Contributor

ffranr commented Aug 31, 2017

This was first mentioned on Slack by Sorina: https://iqbalgroup.slack.com/archives/C5MANUB17/p1504131606000222

This link shows benchmarking of different hash functions for random and sequential integer inserts:
http://incise.org/hash-table-benchmarks.html

Gramtools currently uses the Boost hash function. Benchmarking results show that this is the slowest algorithm when > 20 mil integers are inserted. When handling the human PRG, gramtools attempts to index 60 mil kmers.

Several other libraries provide hashing functions which are apparently faster than our current implementation.

@sm0179
Copy link
Collaborator

sm0179 commented Aug 31, 2017

duplicate of #47

@iqbal-lab
Copy link
Collaborator

The Bob Jenkins hash works fine in Cortex. BooPHF/related use a minimal perfect hash for much bigger sets and are fast

@ffranr
Copy link
Contributor Author

ffranr commented Jun 5, 2018

Copied over the post from the duplicate:
@iqbal-lab said:

I don't know the exact RAM usage by our kmer hash, but for future reference, we could replace it with https://github.com/rob-p/BooM which is a minimal perfect hash map based on Rayan's BBhash - uses about 3bits per element apparently. So for human, if we want to store 13mers, we need to store 66 million kmers--->this would only take 25Mb

Closing duplicate #47 .

@ffranr ffranr removed the duplicate label Jun 5, 2018
@ffranr ffranr changed the title Performance enhancement for kmer hashing Faster kmer hashing Jun 5, 2018
@ffranr
Copy link
Contributor Author

ffranr commented Jun 6, 2018

The hash function we currently used is here:

boost::hash_range(hash, seq.begin(), seq.end());

@leoisl
Copy link

leoisl commented Feb 15, 2019

hello,

just restarting this discussion, since there is a new interesting related paper, and it is also nice to start looking around gramtools code & issues.

I guess https://github.com/rob-p/BooM is a nice option, and makes BBHash an associative index. The problem is that it stores all the keys in a std::vector<std::pair<KeyT, ValueT>> to filter out false positives, so it can be heavy in memory. However, we do have an FM-index backing up this MPHF in gramtools, so we can filter out FPs with one lookup in an associated SearchState.

There is also now the option of using BLight (https://github.com/Malfoy/Blight, https://www.biorxiv.org/content/10.1101/546309v2), which wraps BBHash in a more lightweight associative index, but still increases memory usage to ~28 bits per k-mer. Good to keep this in mind, but I guess it might be better to use the FM-index as backup.

I guess it can also translate into faster queries than the current implementation, which uses std::unordered_map due to boost hash. With BBHash, std::vector is enough due to no collision.

So, I'd vote for a custom BBHash wrapper, similar to https://github.com/rob-p/BooM, but backed up by the (already constructed) gramtools' FM-index to filter out FPs. Do you see any issue or better options than this one?

@ekg
Copy link

ekg commented Feb 15, 2019

A pair perfect hash function linking two perfect hashes of the same data with different base hash functions might provide a way to get a set with membership queries in a fixed number of bits per entry regardless of key length. It would have some theoretically low false positive rate assuming random lookups.

The construction begins by making two perfect minimal hash tables, each for a different hash of the keys. Then we link the tables together by recording which entry each of one of them maps to in the other. The total structure should take N(2b + log N) where b is the number of bits per entry in the perfect hash tables and N is the number of elements. Assuming billions of keys and 2 bits per entry in the perfect hash tables, this would be greater than Blight (I still need to read that paper). It's not clear if performance would be better or worse. It would really help for big numbers of long keys. If the hash projections are random then the false positive rate should be something like 1/N.

@ekg
Copy link

ekg commented Feb 15, 2019

Also, definitely test out https://github.com/skarupke/flat_hash_map and https://github.com/greg7mdp/sparsepp.

I'd love to hear about any other alternatives. In my application I am mostly limited by lookup speed, and can handle a larger construction cost. That said, I found perfect hashing to be a little slower than I'd expected. If you find or do any benchmarks I'd be interested.

@leoisl
Copy link

leoisl commented Feb 15, 2019

A pair perfect hash function linking two perfect hashes of the same data with different base hash functions might provide a way to get a set with membership queries in a fixed number of bits per entry regardless of key length. It would have some theoretically low false positive rate assuming random lookups.

The construction begins by making two perfect minimal hash tables, each for a different hash of the keys. Then we link the tables together by recording which entry each of one of them maps to in the other. The total structure should take N(2b + log N) where b is the number of bits per entry in the perfect hash tables and N is the number of elements. Assuming billions of keys and 2 bits per entry in the perfect hash tables, this would be greater than Blight (I still need to read that paper). It's not clear if performance would be better or worse. It would really help for big numbers of long keys. If the hash projections are random then the false positive rate should be something like 1/N.

Indeed something interesting to test. I would guess queries would be very fast with this. I think that in real datasets, N should be large, so 1/N is fine.

However, in the example discussed above:

So for human, if we want to store 13mers, we need to store 66 million kmers

this would take 251 MB (using b=3 from BBHash) which would be even more than storing 66M k-mers explicitly in a 2-bit encoding (228 MB). I don't know the good length of k-mers to be indexed in gramtools (really new to it), but in case of low values of k or N, https://github.com/pierrepeterlongo/quasi_dictionary seems a good alternative.

@leoisl
Copy link

leoisl commented Feb 15, 2019

Also, definitely test out https://github.com/skarupke/flat_hash_map and https://github.com/greg7mdp/sparsepp.

I'd love to hear about any other alternatives. In my application I am mostly limited by lookup speed, and can handle a larger construction cost. That said, I found perfect hashing to be a little slower than I'd expected. If you find or do any benchmarks I'd be interested.

Thanks for the suggestions! I shall take a look at these. I will keep you posted in case of any benchmarks. Please feel free to share more about hashing!!

Thanks!

@leoisl
Copy link

leoisl commented Feb 15, 2019

update: I had in mind that https://github.com/rob-p/BooM was not an associative index wrapper for BBHash, but it is... edited #71 (comment) accordingly

@ekg
Copy link

ekg commented Feb 16, 2019

The pairPMHF wouldn't be too useful when storing 13-mers, because they are smaller than the mapping that's needed between the PMHFs. But if the keys are very big then the savings would become significant. I don't know what kmers size that would have to be. Maybe all the 71-kmers in a read set would be an interesting thing to store in a structure like this.

The n log n factor could be tuned to a given false positive rate. For instance, if you could accept FP at 1/256 then you'd need only 8 bits per entry.

@ekg
Copy link

ekg commented Feb 19, 2019

Ah so the quasi dictionary is sufficient. What a silly idea I had.

A checksum or other hash into any small number of bits would provide the same guarantees. This checksum could be made from another PMHF, which is interesting but maybe overly complicated.

@leoisl
Copy link

leoisl commented Feb 20, 2019

Hello!

Sorry for the delay on answering! It is nice to have the pairMPHF in mind, and a checksum from another MPHF could yield some interesting stuff, it can be useful in some situations I guess.

I thought about using a classical Bloom Filter (BF) together with a MPHF to have membership queries with low FPs, but the Quasi Dictionary (QD) uses less memory than classical Bloom Filters with the same rate of FPs, and it is probably faster, since only one hash is needed.

I guess the lightest structure would still be MPHF backed up by the FM-index (which is already constructed). This would take only additional ~3bits/kmer. The problem is that filtering out FPs requires k queries to the wavelet tree, which can be costly... less queries can be made at the cost of admitting some FPs.

The QD is probably faster than the previous approach, taking more space (~(3+f)bits/kmer with a FP of 1/2^f).
https://github.com/rob-p/BooM is a QD where f = 2*k so I think we can skip it.

Other structures to check (probably just a benchmark is going to give us concrete answers):

I shall keep searching for additional alternatives...

@iqbal-lab
Copy link
Collaborator

I look forward to chatting about this when I get back next week.
However, we should be driven by profiling. I think enumeration of kmers
In graph is more of a problem than the hash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants