-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fastmultigather returns different results from sourmash gather in Python, in PR #298 #318
Comments
Looks the like the rocksdb differences are at least partially caused by sourmash-bio/sourmash#3137. |
4 tasks
bluegenes
added a commit
that referenced
this issue
May 8, 2024
bluegenes
added a commit
that referenced
this issue
May 10, 2024
…multigather` (#298) This PR adds utilities for building full gather results file for `fastgather` and non-rocksdb `fastmultigather`, and makes full output default. - Fixes #287 - Fixes #187 - Fixes #254 - includes a local fix for #318, which means that the `fastgather` and **non-rocksdb** `fastmultigather` full output here matches sourmash gather. Issues with rocksdb gather are being handled elsewhere. ## Benchmarking | software/version | command | details | time | max RAM | | -------- | -------- | -------- | -- | -- | | branchwater v0.9.3 | `fastgather` | minimal result | <span style="color:green">**1m 47s**</span> | <span style="color:black">**14 GB**</span> | | branchwater v0.9.3-dev | `fastgather` | full result | <span style="color:green">**1m 57s**</span> | <span style="color:black">**14 GB**</span> | | branchwater v0.9.3 | `fastmultigather` | minimal result | <span style="color:red">**8m 3s**</span> | <span style="color:black">**25 GB**</span> | | branchwater v0.9.3-dev | `fastmultigather` | full result | <span style="color:red">**8m 9s**</span> | <span style="color:black">**25 GB**</span> | | branchwater v0.9.3 | `fastmultigather` | rocksdb full result | <span style="color:green">**24s**</span> | <span style="color:green">**600 MB**</span> | progress/separate PRs: - [x] Fill out `match_filename` in full results (#303; requires new sourmash core release with sourmash-bio/sourmash#3121) - [x] switch to using `KmerMinHashBTree` for hash subtraction +benchmark. Per luiz, `KmerMinHashBTree` are better for any situation where we'll be subtracting/adding hashes to a sketch #310 - [x] sourmash: make getting `Record`.filename public in order to keep match_filename and write it to full results. (sourmash-bio/sourmash#3121) - [x] remove --full-results and make full results default #327 --------- Co-authored-by: C. Titus Brown <titus@idyll.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
tl;dr the rust code for full gather results in #298 uses the current metagenome size, and not the original metagenome size, as the denominator; however,
f_unique_to_query
in Python land is calculated using the original metagenome size.observed in the process of debugging this: #298 (comment)
To reproduce:
Build
top2-matches-k31.sig.zip
:Examine/confirm:
Run sourmash gather/Python version:
Run fastmultigather:
Look at results from fastmultigather:
Compare to results from Python gather:
Note difference in
f_unique_to_query
: 0.017535 vs 0.017151.Debug me
Isolate paraburk:
and nostoc:
Subtract paraburk with gather:
Examine overlap:
Uh-oh, that matches the fastmultigather output!
BUT: if we calculate on the original query, we get the same number as Python gather:
So it looks like fastmultigather is using the current metagenome size, and not the original metagenome size, as the denominator; however,
f_unique_to_query
in Python land is calculated using the original metagenome size.From search.py / Python gather line 585:
How about fastmultigather from rocksdb, and fastgather?
Fastgather:
and rocksdb?
same. OK.
The text was updated successfully, but these errors were encountered: