Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDB-based fastmultigather is broken. #322

Open
ctb opened this issue May 4, 2024 · 2 comments
Open

RocksDB-based fastmultigather is broken. #322

ctb opened this issue May 4, 2024 · 2 comments

Comments

@ctb
Copy link
Collaborator

ctb commented May 4, 2024

On latest main, the following demonstrates that fastmultigather is yielding incorrect (incomplete) results. (This was discovered as part of debugging differences in #298, but the behavior is present in the main branch.)


Extract leptothrix sig:

sourmash sig grep GCF_000019785.1 combined-matches-k31.sig.zip -o leptothrix.sig.zip

Find all overlaps with that:

sourmash search --containment --threshold=0 -n 0 leptothrix.sig.zip combined-matches-k31.sig.zip -o lepto-matches.csv

Extract to new collection:

sourmash sig cat --picklist lepto-matches.csv:name:ident combined-matches-k31.sig.zip -o lepto-matches.sig.zip

Index:

sourmash scripts index lepto-matches.sig.zip -o lepto-matches.sig.zip.rocksdb

Run fmg with resulting rocksdb:

sourmash scripts fastmultigather SRR606249.trim.k31.sig.zip lepto-matches.sig.zip.rocksdb -o SRR606249.x.lepto-matches.fmg-rdb.csv

Run fastgather with the original:

sourmash scripts fastgather SRR606249.trim.k31.sig.zip lepto-matches.sig.zip -o SRR606249.x.lepto-matches.fg.csv

Observe different output sizes: 10 matches from fmg-rdb, 13 for fg:

% wc -l SRR606249.x.lepto-matches.fg.csv SRR606249.x.lepto-matches.fmg-rdb.csv
      14 SRR606249.x.lepto-matches.fg.csv
      11 SRR606249.x.lepto-matches.fmg-rdb.csv

Inspection of results, ordered by unique_intersect_bp:

fmg other
9304000 9304000 True True GCF_000013645.1 Paraburkholderia xenovorans LB400 strain=LB400, ASM1364v1
5159000 5159000 True True GCF_000195675.1 Bordetella bronchiseptica RB50 strain=RB50, ASM19567v1
5062000 5062000 True True GCF_000016425.1 Salinispora tropica CNB-440 strain=CNB-440, ASM1642v1
5050000 5050000 True True GCF_000018865.1 Chloroflexus aurantiacus J-10-fl strain=J-10-fl, ASM1886v1
4741000 4743000 False True GCF_000019785.1 Leptothrix cholodnii SP-6 strain=SP-6, ASM1978v1
???	 4741000 4743000 2000
	 GCF_000019785.1 Leptothrix cholodnii SP-6 strain=SP-6, ASM1978v1
4529000 4529000 True True GCF_000011965.2 Ruegeria pomeroyi DSS-3 strain=DSS-3, ASM1196v2
4258000 4260000 False True GCF_000373845.1 Salinispora arenicola CNS673 strain=CNS673, ASM37384v1
???	 4258000 4260000 2000
	 GCF_000373845.1 Salinispora arenicola CNS673 strain=CNS673, ASM37384v1
3199000 3202000 False True GCF_020546685.1 Deinococcus radiodurans ATCC 13939 strain=ATCC 13939, ASM2054668v1
???	 3199000 3202000 3000
	 GCF_020546685.1 Deinococcus radiodurans ATCC 13939 strain=ATCC 13939, ASM2054668v1
201000 419000 False False GCF_002799245.1 Stenotrophomonas maltophilia strain=EA1, ASM279924v1
***	 GCF_002799245.1 Stenotrophomonas maltophilia strain=EA1, ASM279924v1
	 GCA_011765705.1 Salinispora arenicola strain=BRA 172, ASM1176570v1
170000 202000 False False GCA_011765705.1 Salinispora arenicola strain=BRA 172, ASM1176570v1
***	 GCA_011765705.1 Salinispora arenicola strain=BRA 172, ASM1176570v1
	 GCF_002799245.1 Stenotrophomonas maltophilia strain=EA1, ASM279924v1
0 157000 False False XXX
***	 XXX
	 GCF_000375005.1 Salinispora arenicola CNS051 strain=CNS051, ASM37500v1
0 78000 False False XXX
***	 XXX
	 GCF_023845025.1 Salinispora arenicola strain=RJA3005, ASM2384502v1
0 51000 False False XXX
***	 XXX
	 GCF_000377605.1 Salinispora arenicola CNT859 strain=CNT859, ASM37760v1

shows rapidly increasing disagreement - the numbers in the first column are the fastmultigather results, the numbers in the second are from fastgather.

See notebook compare-fmg-limited for processing & comparison code.

@ctb
Copy link
Collaborator Author

ctb commented May 4, 2024

Oh, and here are combined-matches-k31.zip and lepto-matches.sig.zip in case you don't want to have to regenerate those!
lepto-matches.sig.zip
combined-matches-k31.sig.zip

@ctb
Copy link
Collaborator Author

ctb commented May 10, 2024

Related - in sourmash-bio/sourmash#3138 (comment), commit sourmash-bio/sourmash@10d5ee8, I add an analog to the Python test test_gather_metagenome. This demonstrates what is hopefully 😭 the same problem - we're getting only 6 matches in the Rust code (instead of 11), and the 6th match starts to diverge from the values we see in the Python implementation.

bluegenes pushed a commit to sourmash-bio/sourmash that referenced this issue May 10, 2024
This PR fixes an issue introduced in #2943 where we introduced a subtly
broken calculation that uses the _current_ size of the query metagenome
as the denominator for the `f_unique_to_query` calculation.

Fixes #3137

This PR also adds some commented-out test code that demonstrates
#3139 /
sourmash-bio/sourmash_plugin_branchwater#322.
That's something I haven't been able to debug, so I'd suggest fixing
that independently - I'd rather fix _a_ problem _now_, rather than
waiting until we can fix _multiple_ problems at some later indeterminate
time :).

## Notes

- [x] do we need to fix same problem in `linear.rs`? or just rename
things per #3137?
- [x] we should add some tests for this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant