Bypassing ordering step #32

sumit-walia · 2022-08-10T20:31:49Z

I am building PanGraph for Sars-CoV2 sequences (around 50k). I observed that the ordering step is a bottleneck here. Is it possible to pass a tree as an argument and bypass the ordering step in PanGraph? Also, I have tried adding "-d mash" as an argument to check the speedup but it throws an error instead.

Any help would be appreciated.

Thanks

mmolari · 2022-08-11T07:44:46Z

Hi @sumit-walia,

at the moment we do not have an option to pass a custom ordering, but it's an interesting suggestion. It's something we might consider adding at a later iteration.

In the meantime I can try to help with the problem with mash, and see if this speeds things up. Would you mind sharing the error message you get? One possibility is that the mash command is missing from you path. A requirement for the -d mash option to work is that you need to have the mash command installed and available in your path. Can be also installed as a conda package.

Aside from that, I never tried building graphs with many (>1k) short (<1Mbp) genomes. I'd be curious to know how pangraph behaves in this regime. Depending on what your question is, it might be important to tune the -a,-b, -l parameters.

Best,
Marco

sumit-walia · 2022-08-11T17:01:29Z

Hi @mmolari,

I have installed mash and it is available in my path. -d mash option gives the following error:

ERROR: TaskFailedException Stacktrace: [1] wait @ ./task.jl:334 [inlined] [2] fetch @ ./task.jl:349 [inlined] [3] (::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"})(path::String, io::IOStream) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:67 [4] mktemp(fn::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"}, parent::String) @ Base.Filesystem ./file.jl:722 [5] mktemp(fn::Function) @ Base.Filesystem ./file.jl:720 [6] mash(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:57 [7] ordering(::Function, ::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:313 [8] align(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}; compare::Function, energy::Function, minblock::Int64, reference::Nothing, sensitivity::String, maxiter::Int64) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:634 [9] (::PanGraph.var"#1#4")(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/build.jl:126 [10] run(cmd::PanGraph.Commands.Command, args::Vector{String}) @ PanGraph.Commands ~/tools/pangraph/src/args.jl:182 [11] main(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/PanGraph.jl:87 [12] julia_main() @ PanGraph ~/tools/pangraph/src/PanGraph.jl:92 [13] top-level scope @ none:1 nested task error: ArgumentError: input string is empty or only contains whitespace Stacktrace: [1] tryparse_internal(#unused#::Type{Int64}, s::String, startpos::Int64, endpos::Int64, base_::Int64, raise::Bool) @ Base ./parse.jl:109 [2] parse(::Type{Int64}, s::String; base::Nothing) @ Base ./parse.jl:241 [3] parse @ ./parse.jl:241 [inlined] [4] (::PanGraph.Graphs.Shell.var"#compare#10")(path::String) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:43 [5] (::PanGraph.Graphs.Shell.var"#9#14"{String, PanGraph.Graphs.Shell.var"#compare#10"})() @ PanGraph.Graphs.Shell ./task.jl:423

Thanks,
Sumit

mmolari · 2022-08-12T11:33:49Z

Hi @sumit-walia,

from the error message it looks like the code is failing on this line. This is where the output of mash triangle is parsed. In particular here the content of the first line of the output is captured. This should contain simply the total number of sequences, but from the error message it seems like the first line is empty.

I tried reproducing the issue on my end. I downloaded ~3000 sars-cov-2 sequences from NCBI and run pangraph with the command:

pangraph build -d mash sequences.fa

but I could not reproduce the error. Pangraph executed successfully. However when inspecting the output graph I noticed that a single genome was missing from the graph. After some debugging I found out that this was due to an input-output stream not being flushed between a write and a read operation. Now the problem should be fixed. Could you try pulling again the master branch and testing if this solves the problem in your case?

If not, here are some further checks that we could do to try to figure out where the error comes from:

could you check that the version of mash you have corresponds to >=2.3?
could you try running mash triangle sequences.fa on your set of sequences? In particular could you check that the first line of the output is an integer corresponding to the number of sequences in your dataset?
could you share the exact command you are running pangraph with?
could you try to run it with a subset of sequences (e.g. 1000) and see if the problem persists?

Thanks for your feedback, it's much appreciated!

Marco

sumit-walia · 2023-05-22T22:43:41Z

Hi, I realized that the error was caused by an outdated version of Mash that I was using. Although I can now use -d mash, it still runs slowly.

Thanks,
Sumit

sumit-walia closed this as completed May 22, 2023

sumit-walia reopened this May 22, 2023

conmeehan mentioned this issue Aug 22, 2023

Non-boolean error in pangraph export #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypassing ordering step #32

Bypassing ordering step #32

sumit-walia commented Aug 10, 2022

mmolari commented Aug 11, 2022

sumit-walia commented Aug 11, 2022 •

edited

mmolari commented Aug 12, 2022

sumit-walia commented May 22, 2023

Bypassing ordering step #32

Bypassing ordering step #32

Comments

sumit-walia commented Aug 10, 2022

mmolari commented Aug 11, 2022

sumit-walia commented Aug 11, 2022 • edited

mmolari commented Aug 12, 2022

sumit-walia commented May 22, 2023

sumit-walia commented Aug 11, 2022 •

edited