Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bypassing ordering step #32

Open
sumit-walia opened this issue Aug 10, 2022 · 4 comments
Open

Bypassing ordering step #32

sumit-walia opened this issue Aug 10, 2022 · 4 comments

Comments

@sumit-walia
Copy link

I am building PanGraph for Sars-CoV2 sequences (around 50k). I observed that the ordering step is a bottleneck here. Is it possible to pass a tree as an argument and bypass the ordering step in PanGraph? Also, I have tried adding "-d mash" as an argument to check the speedup but it throws an error instead.

Any help would be appreciated.

Thanks

@mmolari
Copy link
Collaborator

mmolari commented Aug 11, 2022

Hi @sumit-walia,

at the moment we do not have an option to pass a custom ordering, but it's an interesting suggestion. It's something we might consider adding at a later iteration.

In the meantime I can try to help with the problem with mash, and see if this speeds things up. Would you mind sharing the error message you get? One possibility is that the mash command is missing from you path. A requirement for the -d mash option to work is that you need to have the mash command installed and available in your path. Can be also installed as a conda package.

Aside from that, I never tried building graphs with many (>1k) short (<1Mbp) genomes. I'd be curious to know how pangraph behaves in this regime. Depending on what your question is, it might be important to tune the -a,-b, -l parameters.

Best,
Marco

@sumit-walia
Copy link
Author

sumit-walia commented Aug 11, 2022

Hi @mmolari,

I have installed mash and it is available in my path. -d mash option gives the following error:

ERROR: TaskFailedException Stacktrace: [1] wait @ ./task.jl:334 [inlined] [2] fetch @ ./task.jl:349 [inlined] [3] (::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"})(path::String, io::IOStream) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:67 [4] mktemp(fn::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"}, parent::String) @ Base.Filesystem ./file.jl:722 [5] mktemp(fn::Function) @ Base.Filesystem ./file.jl:720 [6] mash(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:57 [7] ordering(::Function, ::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:313 [8] align(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}; compare::Function, energy::Function, minblock::Int64, reference::Nothing, sensitivity::String, maxiter::Int64) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:634 [9] (::PanGraph.var"#1#4")(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/build.jl:126 [10] run(cmd::PanGraph.Commands.Command, args::Vector{String}) @ PanGraph.Commands ~/tools/pangraph/src/args.jl:182 [11] main(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/PanGraph.jl:87 [12] julia_main() @ PanGraph ~/tools/pangraph/src/PanGraph.jl:92 [13] top-level scope @ none:1 nested task error: ArgumentError: input string is empty or only contains whitespace Stacktrace: [1] tryparse_internal(#unused#::Type{Int64}, s::String, startpos::Int64, endpos::Int64, base_::Int64, raise::Bool) @ Base ./parse.jl:109 [2] parse(::Type{Int64}, s::String; base::Nothing) @ Base ./parse.jl:241 [3] parse @ ./parse.jl:241 [inlined] [4] (::PanGraph.Graphs.Shell.var"#compare#10")(path::String) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:43 [5] (::PanGraph.Graphs.Shell.var"#9#14"{String, PanGraph.Graphs.Shell.var"#compare#10"})() @ PanGraph.Graphs.Shell ./task.jl:423

Thanks,
Sumit

@mmolari
Copy link
Collaborator

mmolari commented Aug 12, 2022

Hi @sumit-walia,

from the error message it looks like the code is failing on this line. This is where the output of mash triangle is parsed. In particular here the content of the first line of the output is captured. This should contain simply the total number of sequences, but from the error message it seems like the first line is empty.

I tried reproducing the issue on my end. I downloaded ~3000 sars-cov-2 sequences from NCBI and run pangraph with the command:

pangraph build -d mash sequences.fa

but I could not reproduce the error. Pangraph executed successfully. However when inspecting the output graph I noticed that a single genome was missing from the graph. After some debugging I found out that this was due to an input-output stream not being flushed between a write and a read operation. Now the problem should be fixed. Could you try pulling again the master branch and testing if this solves the problem in your case?

If not, here are some further checks that we could do to try to figure out where the error comes from:

  • could you check that the version of mash you have corresponds to >=2.3?
  • could you try running mash triangle sequences.fa on your set of sequences? In particular could you check that the first line of the output is an integer corresponding to the number of sequences in your dataset?
  • could you share the exact command you are running pangraph with?
  • could you try to run it with a subset of sequences (e.g. 1000) and see if the problem persists?

Thanks for your feedback, it's much appreciated!

Marco

@sumit-walia
Copy link
Author

Hi, I realized that the error was caused by an outdated version of Mash that I was using. Although I can now use -d mash, it still runs slowly.

Thanks,
Sumit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants