Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running branchwater on large assemblies #14

Open
jmattock5 opened this issue Jan 23, 2024 · 5 comments
Open

running branchwater on large assemblies #14

jmattock5 opened this issue Jan 23, 2024 · 5 comments

Comments

@jmattock5
Copy link

Hello,
Thanks for developing such a great tool. I've been trying to run branchwater on some whole metagenome assemblies that are quite large (0.3G-1G). When I upload even the smaller ones and submit them I don't get any output. I've tried leaving a couple with the tab open for ~12 hours to no avail.
If I leave them for long enough will they eventually complete?
Thanks!
Jenny

@ctb
Copy link

ctb commented Jan 23, 2024

hi @jmattock5, I don't offhand know what the timeout is for when the branchwater backend will give up, but I can give you some insight into why it is taking so long:

you're running into the problem that the runtime for branchwater-web scales with the size of the query. So, a 10 MB genome will take twice as long as a 5 MB genome to query. That's why it's slow.

(Interestingly, the database functionality underlying this scaling is what makes branchwater possible; handling large queries against a large database is much harder!)

@jmattock5
Copy link
Author

Ok, thanks for the explanation. I'll leave it running for longer and see if anything happens.

Would it be possible to run this locally instead? I have sourmash_plugin_branchwater installed, is the database that branchwater uses shareable?

Thanks,
Jenny

@ctb
Copy link

ctb commented Jan 23, 2024

IIRC, the on-disk index for branchwater is 1-2 TB. The raw data is in the 8-10 TB range (and that's something that we can search using the plugin). So, umm, probably a bit too large for download :).

@luizirber would the branchwater web site work faster if a signature with a higher scaled value were used? e.g. 100,000 rather than 1000? Does that even work? cc @bluegenes

@luizirber
Copy link
Member

luizirber commented Jan 30, 2024

Couple of notes on this:

  • There is a 5MB limit on the size of the signature (here) when received by the search server. Mostly was set up to avoid abuse, but there is no good feedback in the web frontend when it fails to run because of that.
  • The web frontend will send a larger signature, even if it is guaranteed to fail. Maybe add a check here to see if the signature > 5mb? Or maybe before this code block
  • We can test the idea of running with a higher scaled (100,000) with a very, very ugly hack: create a sig with s=100000, edit the sig file to use the max_hash value for s=1000, and search should work (the only check is same k and scaled as index, and all the data for s=100000 is contained in the s=1000 index).

In fact, I tried this crime from the last item with a rumen metagenome I had around (98M originally, 285k after downsample to 100,000), and got 171,599 results back, 285 of those above 20% containment.

So yeah, this definitely works. We can change

  • the frontend to allow selecting a scaled value when doing the sketching,
  • and the backend to either accept higher scaled values than what the index was built on (because, again, we have the right data for doing these searches), or do downsampling on the fly. I prefer the first one =]

@luizirber
Copy link
Member

luizirber commented Jan 31, 2024

More crimes: got a question on what are these matches, and going thru SRA IDs manually is boring. But the search server only returns SRA IDs and containment, how can I get the same data the web frontend returns? Like this =]

Prepare a request and send to the web frontend:

$ curl -L -H "Content-Type: application/json" \
    --data-binary @<(echo "{\"signatures\": `jq ".| tostring" fake.sig`}") \
    https://branchwater.sourmash.bio/ > fake.json

(this is wrapping fake.sig, which is a k=21,s=100000 signature posing as a s=1000 signature, into the format expected by the branchwater-web API)

Parsing JSON is also boring, so here is a long oneliner to read fake.json, sort by containment, filter only containment >= 0.2, and then count the organism field to check where they are coming from:

$ jq '. | sort_by(.containment) | map(select(.containment >= 0.2)) | .[].organism'  fake.json|sort | uniq -c|sort -nr
    134 "bovine gut metagenome"
     94 "gut metagenome"
     39 "Bos taurus"
     17 "metagenome"
     10 "sheep gut metagenome"
      2 "Cervus nippon"
      2 "bovine metagenome"

So yeah, definitely rumen metagenomes. But not only cow, also got sheep and Sika deer

Finally, without filtering (all matches, even those that are only 0.3%):

$ jq '. | sort_by(.containment) | .[].organism'  fake.json|sort | uniq -c|sort -nr 
    502 "bovine gut metagenome"
    404 "gut metagenome"
    148 "metagenome"
     77 "Bos taurus"
     26 "sheep gut metagenome"
     17 "bovine metagenome"
     14 "goat gut metagenome"
      5 "Ovis aries"
      5 "Bos indicus"
      3 "Elaphurus davidianus"
      3 "Dama dama"
      3 "Cervus nippon"
      3 "Cervus elaphus"
      3 "Cervus albirostris"
      3 "Capra hircus"
      2 "Rusa unicolor"
      2 "Axis porcinus"
      1 "soil metagenome"
      1 "lichen metagenome"
      1 "bioreactor sludge metagenome"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants