Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best-paractice of cross-workflow specification of files #20

Open
SilasK opened this issue May 30, 2023 · 3 comments
Open

Best-paractice of cross-workflow specification of files #20

SilasK opened this issue May 30, 2023 · 3 comments

Comments

@SilasK
Copy link

SilasK commented May 30, 2023

I would like to discuss what is the best way to specify files in a way that they can be used across workflows.

Take the example of two workflows e.g

Workflow 1: reads --> assembly

Workflow 2: assembly + reads --> assembly statistics ...

What is the best way to specify the reads and assembly so that they can be used by different workflows?
Take into account that
Requirement A: The reads might be used at multiple places in Workflow 2.
Requirement B : The reads are probably to be used to infer the total number of samples in the target rule.

With sub-workflows, it would be possible to define otherworkflow(file)

But I think the recommended way now is to use modules and to import the rules Workflow 1 and 2 in a new workflow.
But then I should know which rules I need to modify to adapt the file specification. This should be necessarily defined in the Readme of a workflow.

I don't see how this can be done without massive modifying many rules of an imported workflow.

Any thoughts?

@ning-y
Copy link

ning-y commented May 30, 2023

Here's a first attempt:

Workflow 1 input reads are determined by YAML configuration file, and the final assembly file is tagged either in its contents e.g. header lines, or filename; with a hash representing the input reads used to generate it e.g. hash of read hashes.

Workflow 2 takes input reads and input assembly also by YAML configuration file. It checks either on each run or through a dummy output that the input assembly's information about which input reads were used to generate it matches with the set of input reads it was given.

@SilasK
Copy link
Author

SilasK commented May 31, 2023

Your idea would be to define the path to the files

Something like:

config.yam

read_file_format: "QC/qc_reads/{sample}_{fraction}.fastq.gz"
assembly_file_format: "Assembly/assemblies/{sample}.fasta.gz"

@SilasK
Copy link
Author

SilasK commented May 31, 2023

One could also use a tsv file in which we will specify the headers in a config file.

Ideally using the https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#configuring-scientific-experiments-via-peps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants