Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastmultigather returns different results from sourmash gather in Python, in PR #298 #318

Open
ctb opened this issue May 4, 2024 · 2 comments · Fixed by #319
Open

fastmultigather returns different results from sourmash gather in Python, in PR #298 #318

ctb opened this issue May 4, 2024 · 2 comments · Fixed by #319

Comments

@ctb
Copy link
Collaborator

ctb commented May 4, 2024

tl;dr the rust code for full gather results in #298 uses the current metagenome size, and not the original metagenome size, as the denominator; however, f_unique_to_query in Python land is calculated using the original metagenome size.

observed in the process of debugging this: #298 (comment)

To reproduce:

Build top2-matches-k31.sig.zip:

% sourmash sig grep 'GCF_000013645.1|GCF_000009705.1' combined-matches-k31.sig.zip -o top2matches-k31.sig.zip

Examine/confirm:

% sourmash sig describe top2matches-k31.sig.zip | grep ^signature:
signature: GCF_000009705.1 Nostoc sp. PCC 7120 = FACHB-418 strain=PCC 7120, ASM970v1
signature: GCF_000013645.1 Paraburkholderia xenovorans LB400 strain=LB400, ASM1364v1

Run sourmash gather/Python version:

% sourmash gather SRR606249.trim.k31.sig.zip top2matches-k31.sig.zip \
    -o SRR606249.x.top2matches.python.csv

Run fastmultigather:

% sourmash scripts fastmultigather SRR606249.trim.k31.sig.zip top2matches-k31.sig.zip --full
% mv SRR606249.gather.csv SRR606249.x.top2matches.fmg.csv

Look at results from fastmultigather:

% cut -d, -f10,4 SRR606249.x.top2matches.fmg.csv
f_unique_to_query,match_name
0.021896143934669286,"GCF_000013645.1 Paraburkholderia xenovorans LB400 strain=LB400
0.017535628267779244,"GCF_000009705.1 Nostoc sp. PCC 7120 = FACHB-418 strain=PCC 7120

Compare to results from Python gather:

% cut -d, -f10,4 SRR606249.x.top2matches.python.csv
f_unique_to_query,name
0.021896143934669286,"GCF_000013645.1 Paraburkholderia xenovorans LB400 strain=LB400
0.017151665627243094,"GCF_000009705.1 Nostoc sp. PCC 7120 = FACHB-418 strain=PCC 7120

Note difference in f_unique_to_query: 0.017535 vs 0.017151.

Debug me

Isolate paraburk:

% sourmash sig grep GCF_000013645.1 top2matches-k31.sig.zip  -o paraburk.sig.zip

and nostoc:

% sourmash sig grep GCF_000009705.1 top2matches-k31.sig.zip  -o nostoc.sig.zip 
% sourmash sig cat nostoc.sig.zip -o nostoc.sig.gz

Subtract paraburk with gather:

% sourmash gather SRR606249.trim.k31.sig.zip paraburk.sig.zip --output-unassigned SRR606249-minus-paraburk.k31.sig.gz

Examine overlap:

% sourmash sig overlap SRR606249-minus-paraburk.k31.sig.gz nostoc.sig.gz

== This is sourmash version 4.8.9.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded one signature each from SRR606249-minus-paraburk.k31.sig.gz and nostoc.sig.gz
first signature:
  signature filename: SRR606249-minus-paraburk.k31.sig.gz
  signature: 
  md5: f5295cfa1a0426970c415d702b1ac928
  k=31 molecule=DNA num=0 scaled=1000

second signature:
  signature filename: nostoc.sig.gz
  signature: GCF_000009705.1 Nostoc sp. PCC 7120 = FACHB-418 strain=PCC 7120, ASM970v1
  md5: 264cfdad44548ad96c4a24b6a514a877
  k=31 molecule=DNA num=0 scaled=1000

similarity:                  0.01754
first contained in second:   0.01754
second contained in first:   0.99904

number of hashes in first:   415611
number of hashes in second:  7295

number of hashes in common:  7288
only in first:               408323
only in second:              7
total (union):               415618

Uh-oh, that matches the fastmultigather output!

BUT: if we calculate on the original query, we get the same number as Python gather:

% sourmash sig overlap SRR606249.trim.k31.sig.zip.gz nostoc.sig.gz

first signature:
  signature filename: SRR606249.trim.k31.sig.zip.gz
  signature: SRR606249
  md5: 1b1b1d8e954aa9b36214fcaddf05c178
  k=31 molecule=DNA num=0 scaled=1000

second signature:
  signature filename: nostoc.sig.gz
  signature: GCF_000009705.1 Nostoc sp. PCC 7120 = FACHB-418 strain=PCC 7120, ASM970v1
  md5: 264cfdad44548ad96c4a24b6a514a877
  k=31 molecule=DNA num=0 scaled=1000

similarity:                  0.01715
first contained in second:   0.01715
second contained in first:   0.99904

number of hashes in first:   424915
number of hashes in second:  7295

number of hashes in common:  7288
only in first:               417627
only in second:              7
total (union):               424922

So it looks like fastmultigather is using the current metagenome size, and not the original metagenome size, as the denominator; however, f_unique_to_query in Python land is calculated using the original metagenome size.

From search.py / Python gather line 585:

self.f_unique_to_query = (
    len(self.gather_comparison.intersect_mh) / self.orig_query_len
)

How about fastmultigather from rocksdb, and fastgather?

Fastgather:

% sourmash scripts fastgather SRR606249.trim.k31.sig.zip top2matches-k31.sig.zip --full -o SRR606249.x.top2matches.fg.csv
% cut -d, -f4,10,14 SRR606249.x.top2matches.fg.csv
f_unique_to_query,match_name,gather_result_rank
0.021896143934669286,"GCF_000013645.1 Paraburkholderia xenovorans LB400 strain=LB400,9304000
0.017535628267779244,"GCF_000009705.1 Nostoc sp. PCC 7120 = FACHB-418 strain=PCC 7120,7288000

and rocksdb?

% sourmash scripts index top2matches-k31.sig.zip -o top2matches-k31.sig.zip.rocksdb
% sourmash scripts fastmultigather SRR606249.trim.k31.sig.zip top2matches-k31.sig.zip.rocksdb -o SRR606249.x.top2matches.fmg-rdb.csv
% f_unique_to_query,match_name,gather_result_rank
0.017535628267779244,"GCF_000009705.1 Nostoc sp. PCC 7120 = FACHB-418 strain=PCC 7120,7288000
0.021896143934669286,"GCF_000013645.1 Paraburkholderia xenovorans LB400 strain=LB400,9304000

same. OK.

@ctb
Copy link
Collaborator Author

ctb commented May 4, 2024

Suggest apply fix #319 per #318 :).

@ctb
Copy link
Collaborator Author

ctb commented May 4, 2024

Looks the like the rocksdb differences are at least partially caused by sourmash-bio/sourmash#3137.

bluegenes added a commit that referenced this issue May 8, 2024
**Note:** PR into #298

Fixes #318

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
bluegenes added a commit that referenced this issue May 10, 2024
…multigather` (#298)

This PR adds utilities for building full gather results file for `fastgather` and non-rocksdb `fastmultigather`, and makes full output default.

- Fixes #287 
- Fixes #187
- Fixes #254
- includes a local fix for #318, which means that the `fastgather` and **non-rocksdb** `fastmultigather` full output here matches sourmash gather. Issues with rocksdb gather are being handled elsewhere.

## Benchmarking

| software/version | command | details | time | max RAM |
| -------- | -------- | -------- | -- | -- |
| branchwater v0.9.3 | `fastgather` | minimal result | <span style="color:green">**1m 47s**</span> | <span style="color:black">**14 GB**</span> |
| branchwater v0.9.3-dev | `fastgather` | full result | <span style="color:green">**1m 57s**</span> | <span style="color:black">**14 GB**</span> |
| branchwater v0.9.3 | `fastmultigather` | minimal result | <span style="color:red">**8m 3s**</span> | <span style="color:black">**25 GB**</span> |
| branchwater v0.9.3-dev | `fastmultigather` | full result | <span style="color:red">**8m 9s**</span> | <span style="color:black">**25 GB**</span> |
| branchwater v0.9.3 | `fastmultigather` | rocksdb full result | <span style="color:green">**24s**</span> | <span style="color:green">**600 MB**</span> |

progress/separate PRs:
- [x] Fill out `match_filename` in full results (#303; requires new sourmash core release with sourmash-bio/sourmash#3121)
- [x] switch to using `KmerMinHashBTree` for hash subtraction +benchmark. Per luiz, `KmerMinHashBTree` are better for any situation where we'll be subtracting/adding hashes to a sketch #310
- [x] sourmash: make getting `Record`.filename public in order to keep match_filename and write it to full results. (sourmash-bio/sourmash#3121)
- [x] remove --full-results and make full results default #327 

---------

Co-authored-by: C. Titus Brown <titus@idyll.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant