Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow output definition #1227

Draft
wants to merge 15 commits into
base: dev
Choose a base branch
from

Conversation

bentsherman
Copy link

This PR adds a workflow output definition based on nextflow-io/nextflow#4784. I'm still working through the pipeline, but once I'm done, I will have completely replaced publishDir using the output DSL.

See also nf-core/fetchngs#275 for ongoing discussion

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Copy link

github-actions bot commented Feb 28, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 783ff86

+| ✅ 170 tests passed       |+
#| ❔   7 tests were ignored |#
!| ❗   7 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!

❔ Tests ignored:

  • files_exist - File is ignored: conf/modules.config
  • nextflow_config - Config default ignored: params.ribo_database_manifest
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
  • multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

Run details

  • nf-core/tools version 2.13
  • Run at 2024-02-28 22:32:03

@bentsherman bentsherman changed the title Add output definition Workflow output DSL Feb 28, 2024
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@adamrtalbot

This comment was marked as outdated.

@adamrtalbot

This comment was marked as outdated.

@bentsherman

This comment was marked as outdated.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@pinin4fjords
Copy link
Member

Liking this. If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it.

@adamrtalbot
Copy link
Contributor

This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:

  1. You need to track back to the channel to find what's in there, which could be a little tricky.
  2. It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved
workflows/rnaseq/main.nf Outdated Show resolved Hide resolved
@maxulysse
Copy link
Member

This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:

1. You need to track back to the channel to find what's in there, which could be a little tricky.

2. It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.

Agreeing with Adam, it's a bit too implicit, especially what is a path what is a topic

@bentsherman
Copy link
Author

In the Nextflow PR there are some docs which explain the feature in more detail. Unfortunately the deploy preview isn't working so you'll have to look at the diff

You need to track back to the channel to find what's in there, which could be a little tricky.

Indeed this is the downside of selecting channels instead of processes. More flexible but more layers of indirection. We should be able to alleviate this with IDE tooling, i.e. hover over a selected channel to see it's definition

If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it

Thanks @pinin4fjords , I never responded to your idea about putting everything in the emit section, but basically I think that would be way too cumbersome, imagine trying to fit the rnaseq outputs into the emits 😅

The main question now is, how to bring the outputs for PREPARE_GENOME and RNASEQ up to the top-level workflow? I was thinking some kind of include statement, otherwise we would have to pass a LOT of channels up through emits and/or topics.

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved
workflows/rnaseq/main.nf Outdated Show resolved Hide resolved
workflows/rnaseq/main.nf Outdated Show resolved Hide resolved
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Author

The current prototype simply maps the output channels to the publish directory structure, but we still need to get these outputs to the top level whereas currently they are nested under NFCORE_RNASEQ:...

Before I go off and add a gajillion channels to the emit section, I'd like to see if I can simplify things with topics.

@adamrtalbot @pinin4fjords @maxulysse @ewels Since you guys understand this pipeline better than me, I'm wondering, how would you group all of these outputs if you could group them any way you want? You are no longer restricted to process selectors or directory names, but you could use those if you wanted.

For example, I see the modules config for RNASEQ is grouped with these comments:

  • STAR Salmon alignment
  • General alignment
  • bigwig coverage
  • DESeq2 QC
  • Pseudo-alignment

Would those be good top-level groupings for outputs? Then you might have topics called align-star-salmon, align, bigwig, deseq2, etc. Or would you organize it differently?

@bentsherman
Copy link
Author

I managed to move everything to the top-level workflow, so it should be executable now (though there are likely some bugs, will test tomorrow).

I ended up using topics for everything, using the various publish directories to guide the topic names. Hope this gives you a more concrete sense of how topics are useful.

The topics don't really reduce the amount of code, they just split it between the output DSL and the workflow topic: section. In a weird way, this provides some modularity, since workflows can define some ontology of topics which can in turn be used by the output DSL for publishing.

@pinin4fjords
Copy link
Member

As Evan mentioned on Slack, this does seem very verbose:

    QUANTIFY_STAR_SALMON.out.results                        >> 'align'
    QUANTIFY_STAR_SALMON.out.tpm_gene                       >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene                    >> 'align'
    QUANTIFY_STAR_SALMON.out.lengths_gene                   >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene_length_scaled      >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene_scaled             >> 'align'
    QUANTIFY_STAR_SALMON.out.tpm_transcript                 >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_transcript              >> 'align'
    QUANTIFY_STAR_SALMON.out.lengths_transcript             >> 'align'

But I understand why, since if even one of the outputs from a process needs to go to a different topic then you can't use the multi-channel object QUANTIFY_STAR_SALMON.out.

Rather than doing this from the calling workflow, could e.g. QUANTIFY_STAR_SALMON use a topic as part of its emit, to 'suggest' a classification for that channel?

emit:
    results = ch_pseudo_results, topic = 'tables'                             

Then, if we all used good standards (e.g. an ontology for topics for outputs), calling workflows could have very minimal logic for this, relying on what the components said about their outputs. The calling workflow would only need to decide what to do with the topics in its outputs.

@bentsherman
Copy link
Author

We can definitely move some of these topic mappings into the modules and subworkflows, that was going to be my next step. I also suspect that nf-core will be able to converge on a shared ontology for these things.

I'd still rather keep the topic mapping separate from the emits though, as we will need the topic: section either way and we're trying to minimize the number of ways to do the same thing

@bentsherman
Copy link
Author

I moved most of the topic mappings into their respective subworkflows. It gets tricky when a workflow is used multiple times under a different name and with different publish behavior.

For example, QUANTIFY_PSEUDO_ALIGNMENT is used twice in RNASEQ, once as itself and once as the alias QUANTIFY_STAR_SALMON. One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

I can't set a "sensible default" in the subworkflow because I can't override the default later, I can only specify additional topics. Or I could specify a default and not use it in the output definition for rnaseq, instead re-mapping each alias to different topics as I am currently doing.

However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic: section:

  topic:
  if{ !params.skip_alignment && params.aligner == 'star_rsem' ) {
    DESEQ2_QC_RSEM.out.rdata                >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.pca_txt              >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.pdf                  >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.dists_txt            >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.size_factors         >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.log                  >> 'align-deseq2'
  }

Totally doable, but unfortunate if we have to resort to it

@adamrtalbot noted in Slack that most Nextflow pipelines don't come close to this level of complexity, so I wouldn't be opposed to moving forward with what we have and let the rnaseq maintainers sort out the details. Though we do need to address the last point about conditional topic mappings

@pinin4fjords
Copy link
Member

I'm loving the principle:

    "${params.aligner}" {
        'log' {
            from 'align-star-log'
        }

        from 'align-star-intermeds'

        'unmapped' {
            from 'align-star-unaligned'
        }
    }

The multi import thing didn't occur to me. Could we use a variable sent in via the meta or somesuch to control the topic something gets sent to?

@adamrtalbot
Copy link
Contributor

adamrtalbot commented Apr 10, 2024

For example, QUANTIFY_PSEUDO_ALIGNMENT is used twice in RNASEQ, once as itself and once as the alias QUANTIFY_STAR_SALMON. One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

In this case I would manipulate the channel to what I wanted. If I had to use a topic I would use them at the last second. So again, as long as topics are optional I think everything can be handled reasonably well.

However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic: section:

Presumably, if a topic is empty it just doesn't publish anything? So you could add stuff from an empty channel and you would end up with an empty topic. In your example, it would make more sense to fix the rnaseq code so it doesn't rely on lots of if statements, which would end up looking like this:

  topic:
  deseq2_qc_rdata        >> 'align-deseq2'
  deseq2_qc_pca_txt      >> 'align-deseq2'
  deseq2_qc_pdf          >> 'align-deseq2'
  deseq2_qc_dists_txt    >> 'align-deseq2'
  deseq2_qc_size_factors >> 'align-deseq2'
  deseq2_qc_log          >> 'align-deseq2'

Even better, just tidy up the channels before making the topic:

  topic:
  deseq2_qc >> 'align-deseq2'

I think my overall impression is topics are a nice sugar on top of existing channels, in which case most of the key logic should be in the channel manipulations. Topics are a way of turning a local channel into a global one and should do very little else.

One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

That sounds like a bug 😆

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Author

Notes on the latest update:

  • Topics are no longer used. Nextflow simply maintains a global map of channels to "rules" under the hood

  • The output DSL is no longer a potentially nested directory structure, it's just a flat list of rules. Each rule can specify publish options for channels that are sent to the rule

  • In principle, the rule name can be anything. In practice, it is convenient to make it the default publish path. If you're happy with that, you don't need to configure anything else and Nextflow will use it as the publish path

  • Processes and workflows can have a publish: section to define these mappings. A process can map emits to rules, a workflow can map channels to rules

  • The output DSL is used only to (1) set the output directory, (2) set default publish options like mode, (3) customize rules as needed

  • In general, rules need to be customized only when the path should be different or additional options like enabled are needed. If you can align your output directory with the module/workflow defaults, your output definition can be quite short (see fetchngs)

  • If a process maps some emits to some rules and then is invoked by a workflow, the workflow can re-map the process outputs to different rules and overwrite the process defaults, and so on with workflows and subworkflows, etc

Overall, everything is much more concise and more in line with what many people have suggested, to simply annotate the workflows with the publish paths. The output definition is no longer a comprehensive view of all outputs, but there is a degree of modularity, and you can be verbose in the output definition if you want to.

@adamrtalbot thanks for your comments, makes me feel more confident about the prototype. I think all of the remaining TODOs can be addressed by refactoring some dataflow logic, it can be handled by the rnaseq devs.

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved
@pinin4fjords
Copy link
Member

Really liking the way this is going now, it's going to be very tidy.

Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?

    publish:
    ch_orig_bam         >> "star_salmon/intermeds/${meta.publish_suffix}/"

@bentsherman
Copy link
Author

Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?

Maybe in a future iteration. But related to this, we are interested in building on the concept of the samplesheet as a way to hold metadata for file collections in general, and it might be a better practice than trying to encode metadata in the filenames.

For example Paolo has proposed that we have a method in the output DSL to automatically publish an index file for a given "rule":

output {
  directory 'results'

  'star_salmon/intermeds/' {
    index 'index.csv'
  }
}

star_salmon/intermeds/index.csv

sample_id,bam
sample001,results/star_salmon/intermeds/sample001.bam
sample002,results/star_salmon/intermeds/sample002.bam
sample003,results/star_salmon/intermeds/sample003.bam

Of course you could also do this manually like in fetchngs, and I would like to add a stdlib function like mergeCsv to make it easier, but the index method would be a convenient solution for the most common and simple cases. Either way, you can just query the index file instead of inspecting the file names.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman changed the title Workflow output DSL Workflow publish definition Apr 24, 2024
@bentsherman
Copy link
Author

The redirect to null simplifies the top-level publish def somewhat. The remaining rules could also be moved into the workflow defs since they only rename paths. It just might be more verbose since you would have to remap each channel instead of the target name.

It seems like the best delineation for what goes in the top-level publish block vs the workflow publish sections is, the workflows define what is published (including conditional logic) while the top-level publish def should define how things are being published (mode, whether to overwrite, content type, tags, etc). This is also good for modularity.

Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.

@pinin4fjords
Copy link
Member

Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.

We have been trying to eliminate that when we see it

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Comment on lines +168 to +171
output {
directory params.outdir
mode params.publish_dir_mode
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be set in nextflow.config @bentsherman?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not currently, but should be considered as a future improvement. If we don't need the target-specific config in the output block, would be better IMO to make these config settings instead of pipeline code + params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this as a scope in the config rather than in the main.nf.

I was also thinking to allow different modes for files matching different patterns or properties, e.g. large files are symlinked, rather than copied, but this is an extra.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that they should be config. Much less flexible for the user if you have to define a param just to make it configurable

For now you can customize things like the mode for specific targets, though that it not as granular as what you describe

@bentsherman bentsherman changed the title Workflow publish definition Workflow ouput definition May 17, 2024
@bentsherman bentsherman changed the title Workflow ouput definition Workflow output definition May 17, 2024
Copy link
Member

@mahesh-panchal mahesh-panchal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how it simplifies things quite a bit.

My main issues with this is that the publish path is spread across so many files, which probably means we shouldn't be using this in nf-core in any of the resuable/modular parts, especially if we can't turn off anything without having params for everything.

Paths that need meta data like sample name are not here, but I guess they're directly accessible in the process?

Comment on lines +27 to +28
publish:
pdf >> 'deseq2'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was surprising to still see the publish directive in the process block. Why this way then instead of:

 output:
    path "*.pdf"                , optional:true, emit: pdf, publish: 'deseq2'

i.e. instead of making publish and extra option for the path type

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I wanted the publish definition to be in one place so that it's easier to review. But also, I think the separate publish: section will work better with static typed outputs

Comment on lines +168 to +171
output {
directory params.outdir
mode params.publish_dir_mode
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this as a scope in the config rather than in the main.nf.

I was also thinking to allow different modes for files matching different patterns or properties, e.g. large files are symlinked, rather than copied, but this is an extra.

@@ -14,6 +14,9 @@ process CAT_FASTQ {
tuple val(meta), path("*.merged.fastq.gz"), emit: reads
path "versions.yml" , emit: versions

publish:
reads >> 'cat/fastq'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a pipeline developer turn this off? If I include this module from nf-core/modules in my own pipeline, but don't want to publish the output from this, what can I do?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can redirect the output to null:

workflow {
  CAT_FASTQ()
  publish:
  CAT_FASTQ.out >> null
}

And Paolo just added an enabled option so that you can control it from the output block:

output {
  'cat/fastq' {
    enabled false
  }
}

@@ -62,6 +62,16 @@ workflow ALIGN_STAR {
BAM_SORT_STATS_SAMTOOLS ( ch_orig_bam, fasta )
ch_versions = ch_versions.mix(BAM_SORT_STATS_SAMTOOLS.out.versions)

publish:
ch_orig_bam >> (params.save_align_intermeds || params.save_umi_intermeds ? 'star_salmon/' : null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for specific pipelines, but makes module reusability across pipelines more cumbersome. If we need to have ternary operators in every nf-core module to control whether an output is published, this would make the pipeline schema, potentially huge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I cut some corners here by not passing the params as workflow inputs. In the final implementation I would do that so that you can choose whether to expose it as a param in your own pipeline

@@ -26,6 +26,10 @@ workflow FASTQ_ALIGN_HISAT2 {
BAM_SORT_STATS_SAMTOOLS ( HISAT2_ALIGN.out.bam, ch_fasta )
ch_versions = ch_versions.mix(BAM_SORT_STATS_SAMTOOLS.out.versions)

publish:
HISAT2_ALIGN.out.bam >> (params.save_align_intermeds ? 'hisat2/' : null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to cause problems/user-confusion having the possibility to publish in two places ( workflow and module, e.g. nf-core modules supplies a path, and then the workflow supplies another path )?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that workflow can overwrite the publish targets defined by processes / subworkflows. And if you can find a convention to agree on, you might not need to overwrite anything in the first place.

But I admit I'm not 100% sold on the publish: section for processes. It's more "modular" but if it ends up being overwritten most of the time then it's not very useful.

@@ -782,6 +782,127 @@ workflow RNASEQ {
ch_multiqc_report = MULTIQC.out.report
}

publish:
QUANTIFY_STAR_SALMON.out.results >> 'star_salmon/'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is going to be annoying to users/developers not knowing where a publish path is defined when trying to debug why their file is being written to another location.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main principle is that callers overwrite callees. But also I think we will extend the inspect command to also show the resolved publish targets for the entire pipeline, so that it is more clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants