Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add abund estimation to manysearch #302

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Apr 19, 2024

Adds abundance columns average_abund, median_abund, std_abund to manysearch output.

Changed behavior:

  • Modifies manysearch to only build & write a result if we find an overlap between the query and search sigs. This was to avoid issues with abundance-related columns (average, median, std deviation calculations), but I think generally this:
  • shortens the results file to only relevant results and
  • speeds things up a tad, since we stop doing any unnecessary calculations.

To Do:

  • check vs sourmash/ mgmanysearch to ensure abund estimation is correct

Unfortunately, I can only estimate abundance info from non-rocksdb manysearch. Abund estimation for rocksdb databases would require us to estimate these values w/in sourmash, since we don't have access to which hashes match.

Note: for mgmanysearch type applications, the default threshold, 0.01` is far too stringent. Recommend setting threshold to 0 for these applications.

@bluegenes
Copy link
Contributor Author

Comparing this PR's abund manysearch vs sourmash_plugin_containment_search (mgmanysearch)

Small test dataset:

  • 100,000 query reads
  • 1000 contigs

built and run by @AnneliektH via: https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/notebooks/notes_sourmash.ipynb

(both run with single thread)

software/version command details time max RAM
branchwater v0.9.3-dev manysearch with abundance 2s 103 MB
sourmash_plugin containment_search mgmanysearch with abundance 57s 111 MB

manysearch:

Command being timed: "sourmash scripts manysearch ../sourmash_sketches/sketch_reads/ERR1135178_QC.subset.zip ../sourmash_sketches/ERR11351_1000.zip -k 31 --scaled 100 -o ERR1135178.mn.timed.csv -c 1 -t 0"
        User time (seconds): 1.29
        System time (seconds): 0.69
        Percent of CPU this job got: 91%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 103264
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 9
        Minor (reclaiming a frame) page faults: 26718
        Voluntary context switches: 3122
        Involuntary context switches: 2879
        Swaps: 0
        File system inputs: 0
        File system outputs: 64
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

mgmanysearch:

Command being timed: "sourmash scripts mgmanysearch --against ../sourmash_sketches/sketch_reads/ERR1135178_QC.subset.zip --queries ../sourmash_sketches/ERR11351_1000.zip -k 31 --scaled 100 -o ERR1135178.timed.mgm.csv"        
        User time (seconds): 35.95
        System time (seconds): 5.88
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:57.65
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 111128
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 4026
        Minor (reclaiming a frame) page faults: 312082
        Voluntary context switches: 60369
        Involuntary context switches: 605
        Swaps: 0
        File system inputs: 1021744
        File system outputs: 480
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ctb
Copy link
Collaborator

ctb commented Apr 23, 2024

wow, that's a fantastic speed improvement 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants