Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Pypiper generate a DAG to guide the execution of commands that comprise a pipeline? #189

Open
zhangzhen opened this issue Jun 18, 2023 · 4 comments

Comments

@zhangzhen
Copy link

As far as I know, Pypiper runs commands of a pipeline sequentially, even if some commands can be run concurrently. Will you plan to support the concurrent execution in the near future?

Cheers,
Zhen Zhang

@vreuter
Copy link
Member

vreuter commented Jun 19, 2023

Hey @zhangzhen thanks for this question and idea. I'm not currently developing on pypiper so can't answer about definite development plans, but I can say that you're correct, there's not currently support to declare dependencies among the steps or "stages" of a pipeline. Any concurrency would need to be implemented manually in a pipeline script, and if subclassing Pipeline and defining the stages, the implicit dependency structure among the steps/stages is that they're sequential and to be executed serially.

I could add, though, that I'd love this feature (DAG-like declaration of the relationships among the pipeline's steps, and then automatic conncurrent execution where possible, based on that structure) and would definitely use it! If you're interested in prototyping, I think a PR would be welcome, certainly by me and I think probably by the maintainers, though it's a question for @nsheff

@nsheff
Copy link
Member

nsheff commented Jun 21, 2023

You are right that pypiper is really intended to run sequentially. Our mode of operating is to parallelize by sample, rather than by task within a pipeline. This has lots of advantages, and a few disadvantages -- but for most of the analysis we're doing, it makes a lot of sense and you won't gain any/much efficiency by parallelizing by task if you're parallelizing effectively by sample. Making your pipeline parallel by task also can add complexity to the pipeline, so it isn't always worth it.

That said, you can actually still make a pipeline parallize tasks in pypiper if you need to, it's just not a built-in, recommended thing to do. If you want some guidance on how to do it, let me know and I can show you.

@nsheff
Copy link
Member

nsheff commented Jun 21, 2023

And to directly answer your question: I am not planning to add parallelizing by task like this. But if you want to add it, I would consider a PR, as long is it was a simple solution that didn't complicate the codebase too much.

@zhangzhen
Copy link
Author

zhangzhen commented Jun 21, 2023

I've built bioinformatics pipelines for NGS testing in clinical oncology for more than 5 years. Pyflow and Nextflow are pipeline frameworks I use most of the time. Pyflow is light-weight and does well in sample-level analysis, while Nextflow is heavy-weight and does well in batch-level analysis. However, they both adopt the monolithic approach that makes them do more things than they should do. The modular approach you come up with is the better way to build pipeline frameworks. The philosophy behind a series of softwares such as looper, pypiper, bulker is what I love and brings me inspiration. Moreover, one of your posts helps me form a clearer picture on parallelism in bioinformatics. It's a bit of a pity that I know the work your lab and you have done just a few days ago.

That said, you can actually still make a pipeline parallize tasks in pypiper if you need to, it's just not a built-in, recommended thing to do. If you want some guidance on how to do it, let me know and I can show you.

Pipelines in clinical oncology have indeed such needs. After doing reads mapping, variants calling such as SNV/INDEL calling, CNV calling, SV calling, etc., and QC are often performed simultaneously. Hey @nsheff, could you please show me how to parallelize tasks within a pipeline?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants