Workflow output definition #1227

bentsherman · 2024-02-28T22:30:32Z

This PR adds a workflow output definition based on nextflow-io/nextflow#4784. I'm still working through the pipeline, but once I'm done, I will have completely replaced publishDir using the output DSL.

See also nf-core/fetchngs#275 for ongoing discussion

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

github-actions · 2024-02-28T22:32:22Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 783ff86

+| ✅ 170 tests passed       |+
#| ❔   7 tests were ignored |#
!| ❗   7 tests had warnings |!

❗ Test warnings:

files_exist - File not found: assets/multiqc_config.yml
files_exist - File not found: .github/workflows/awstest.yml
files_exist - File not found: .github/workflows/awsfulltest.yml
pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!

❔ Tests ignored:

files_exist - File is ignored: conf/modules.config
nextflow_config - Config default ignored: params.ribo_database_manifest
files_unchanged - File ignored due to lint config: assets/email_template.html
files_unchanged - File ignored due to lint config: assets/email_template.txt
files_unchanged - File ignored due to lint config: .gitignore or .prettierignore or pyproject.toml
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/rnaseq/rnaseq/.github/workflows/awstest.yml
multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-rnaseq_logo_light.png
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-rnaseq_logo_light.png
files_exist - File found: docs/images/nf-core-rnaseq_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-rnaseq_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/Utils.groovy
files_exist - File not found check: lib/WorkflowMain.groovy
files_exist - File not found check: lib/NfcoreTemplate.groovy
files_exist - File not found check: lib/WorkflowRnaseq.groovy
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 3.15.0dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.hisat2_build_memory= 200.GB
nextflow_config - Config default value correct: params.gtf_extra_attributes= gene_name
nextflow_config - Config default value correct: params.gtf_group_features= gene_id
nextflow_config - Config default value correct: params.featurecounts_group_type= gene_biotype
nextflow_config - Config default value correct: params.featurecounts_feature_type= exon
nextflow_config - Config default value correct: params.igenomes_base= s3://ngi-igenomes/igenomes
nextflow_config - Config default value correct: params.trimmer= trimgalore
nextflow_config - Config default value correct: params.min_trimmed_reads= 10000
nextflow_config - Config default value correct: params.umitools_extract_method= string
nextflow_config - Config default value correct: params.umitools_grouping_method= directional
nextflow_config - Config default value correct: params.aligner= star_salmon
nextflow_config - Config default value correct: params.pseudo_aligner_kmer_size= 31
nextflow_config - Config default value correct: params.min_mapped_reads= 5.0
nextflow_config - Config default value correct: params.kallisto_quant_fraglen= 200
nextflow_config - Config default value correct: params.kallisto_quant_fraglen_sd= 200
nextflow_config - Config default value correct: params.deseq2_vst= true
nextflow_config - Config default value correct: params.rseqc_modules= bam_stat,inner_distance,infer_experiment,junction_annotation,junction_saturation,read_distribution,read_duplication
nextflow_config - Config default value correct: params.skip_bbsplit= true
nextflow_config - Config default value correct: params.skip_preseq= true
nextflow_config - Config default value correct: params.custom_config_version= master
nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Config default value correct: params.max_cpus= 16
nextflow_config - Config default value correct: params.max_memory= 128.GB
nextflow_config - Config default value correct: params.max_time= 240.h
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.max_multiqc_email_size= 25.MB
nextflow_config - Config default value correct: params.validate_params= true
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - assets/nf-core-rnaseq_logo_light.png matches the template
files_unchanged - docs/images/nf-core-rnaseq_logo_light.png matches the template
files_unchanged - docs/images/nf-core-rnaseq_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (531 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: cloud_tests_small.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: download_pipeline.yml
actions_schema_validation - Workflow validation passed: cloud_tests_full.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.13
Run at 2024-02-28 22:32:03

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

pinin4fjords · 2024-04-04T09:15:02Z

Liking this. If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it.

adamrtalbot · 2024-04-04T10:35:22Z

This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:

You need to track back to the channel to find what's in there, which could be a little tricky.
It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.

workflows/rnaseq/main.nf

maxulysse · 2024-04-04T10:42:32Z

This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:
1. You need to track back to the channel to find what's in there, which could be a little tricky.

2. It's quite verbose (there's a lot of text in one place). But then I would prefer explicit and verbose to implicit and concise.

Agreeing with Adam, it's a bit too implicit, especially what is a path what is a topic

bentsherman · 2024-04-04T12:56:56Z

In the Nextflow PR there are some docs which explain the feature in more detail. Unfortunately the deploy preview isn't working so you'll have to look at the diff

You need to track back to the channel to find what's in there, which could be a little tricky.

Indeed this is the downside of selecting channels instead of processes. More flexible but more layers of indirection. We should be able to alleviate this with IDE tooling, i.e. hover over a selected channel to see it's definition

If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it

Thanks @pinin4fjords , I never responded to your idea about putting everything in the emit section, but basically I think that would be way too cumbersome, imagine trying to fit the rnaseq outputs into the emits 😅

The main question now is, how to bring the outputs for PREPARE_GENOME and RNASEQ up to the top-level workflow? I was thinking some kind of include statement, otherwise we would have to pass a LOT of channels up through emits and/or topics.

workflows/rnaseq/main.nf

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-04-08T20:29:16Z

The current prototype simply maps the output channels to the publish directory structure, but we still need to get these outputs to the top level whereas currently they are nested under NFCORE_RNASEQ:...

Before I go off and add a gajillion channels to the emit section, I'd like to see if I can simplify things with topics.

@adamrtalbot @pinin4fjords @maxulysse @ewels Since you guys understand this pipeline better than me, I'm wondering, how would you group all of these outputs if you could group them any way you want? You are no longer restricted to process selectors or directory names, but you could use those if you wanted.

For example, I see the modules config for RNASEQ is grouped with these comments:

STAR Salmon alignment
General alignment
bigwig coverage
DESeq2 QC
Pseudo-alignment

Would those be good top-level groupings for outputs? Then you might have topics called align-star-salmon, align, bigwig, deseq2, etc. Or would you organize it differently?

bentsherman · 2024-04-09T04:04:36Z

I managed to move everything to the top-level workflow, so it should be executable now (though there are likely some bugs, will test tomorrow).

I ended up using topics for everything, using the various publish directories to guide the topic names. Hope this gives you a more concrete sense of how topics are useful.

The topics don't really reduce the amount of code, they just split it between the output DSL and the workflow topic: section. In a weird way, this provides some modularity, since workflows can define some ontology of topics which can in turn be used by the output DSL for publishing.

pinin4fjords · 2024-04-09T09:34:36Z

As Evan mentioned on Slack, this does seem very verbose:

    QUANTIFY_STAR_SALMON.out.results                        >> 'align'
    QUANTIFY_STAR_SALMON.out.tpm_gene                       >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene                    >> 'align'
    QUANTIFY_STAR_SALMON.out.lengths_gene                   >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene_length_scaled      >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_gene_scaled             >> 'align'
    QUANTIFY_STAR_SALMON.out.tpm_transcript                 >> 'align'
    QUANTIFY_STAR_SALMON.out.counts_transcript              >> 'align'
    QUANTIFY_STAR_SALMON.out.lengths_transcript             >> 'align'

But I understand why, since if even one of the outputs from a process needs to go to a different topic then you can't use the multi-channel object QUANTIFY_STAR_SALMON.out.

Rather than doing this from the calling workflow, could e.g. QUANTIFY_STAR_SALMON use a topic as part of its emit, to 'suggest' a classification for that channel?

emit:
    results = ch_pseudo_results, topic = 'tables'

Then, if we all used good standards (e.g. an ontology for topics for outputs), calling workflows could have very minimal logic for this, relying on what the components said about their outputs. The calling workflow would only need to decide what to do with the topics in its outputs.

bentsherman · 2024-04-09T14:47:05Z

We can definitely move some of these topic mappings into the modules and subworkflows, that was going to be my next step. I also suspect that nf-core will be able to converge on a shared ontology for these things.

I'd still rather keep the topic mapping separate from the emits though, as we will need the topic: section either way and we're trying to minimize the number of ways to do the same thing

bentsherman · 2024-04-09T21:00:03Z

I moved most of the topic mappings into their respective subworkflows. It gets tricky when a workflow is used multiple times under a different name and with different publish behavior.

For example, QUANTIFY_PSEUDO_ALIGNMENT is used twice in RNASEQ, once as itself and once as the alias QUANTIFY_STAR_SALMON. One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

I can't set a "sensible default" in the subworkflow because I can't override the default later, I can only specify additional topics. Or I could specify a default and not use it in the output definition for rnaseq, instead re-mapping each alias to different topics as I am currently doing.

However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic: section:

  topic:
  if{ !params.skip_alignment && params.aligner == 'star_rsem' ) {
    DESEQ2_QC_RSEM.out.rdata                >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.pca_txt              >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.pdf                  >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.dists_txt            >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.size_factors         >> 'align-deseq2'
    DESEQ2_QC_RSEM.out.log                  >> 'align-deseq2'
  }

Totally doable, but unfortunate if we have to resort to it

@adamrtalbot noted in Slack that most Nextflow pipelines don't come close to this level of complexity, so I wouldn't be opposed to moving forward with what we have and let the rnaseq maintainers sort out the details. Though we do need to address the last point about conditional topic mappings

pinin4fjords · 2024-04-10T08:01:01Z

I'm loving the principle:

    "${params.aligner}" {
        'log' {
            from 'align-star-log'
        }

        from 'align-star-intermeds'

        'unmapped' {
            from 'align-star-unaligned'
        }
    }

The multi import thing didn't occur to me. Could we use a variable sent in via the meta or somesuch to control the topic something gets sent to?

adamrtalbot · 2024-04-10T11:43:31Z

For example, QUANTIFY_PSEUDO_ALIGNMENT is used twice in RNASEQ, once as itself and once as the alias QUANTIFY_STAR_SALMON. One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

In this case I would manipulate the channel to what I wanted. If I had to use a topic I would use them at the last second. So again, as long as topics are optional I think everything can be handled reasonably well.

However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic: section:

Presumably, if a topic is empty it just doesn't publish anything? So you could add stuff from an empty channel and you would end up with an empty topic. In your example, it would make more sense to fix the rnaseq code so it doesn't rely on lots of if statements, which would end up looking like this:

  topic:
  deseq2_qc_rdata        >> 'align-deseq2'
  deseq2_qc_pca_txt      >> 'align-deseq2'
  deseq2_qc_pdf          >> 'align-deseq2'
  deseq2_qc_dists_txt    >> 'align-deseq2'
  deseq2_qc_size_factors >> 'align-deseq2'
  deseq2_qc_log          >> 'align-deseq2'

Even better, just tidy up the channels before making the topic:

  topic:
  deseq2_qc >> 'align-deseq2'

I think my overall impression is topics are a nice sugar on top of existing channels, in which case most of the key logic should be in the channel manipulations. Topics are a way of turning a local channel into a global one and should do very little else.

One publishes to the folder "${params.aligner}" while the other publishes to "${params.pseudo_aligner}".

That sounds like a bug 😆

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-04-11T19:21:51Z

Notes on the latest update:

Topics are no longer used. Nextflow simply maintains a global map of channels to "rules" under the hood
The output DSL is no longer a potentially nested directory structure, it's just a flat list of rules. Each rule can specify publish options for channels that are sent to the rule
In principle, the rule name can be anything. In practice, it is convenient to make it the default publish path. If you're happy with that, you don't need to configure anything else and Nextflow will use it as the publish path
Processes and workflows can have a publish: section to define these mappings. A process can map emits to rules, a workflow can map channels to rules
The output DSL is used only to (1) set the output directory, (2) set default publish options like mode, (3) customize rules as needed
In general, rules need to be customized only when the path should be different or additional options like enabled are needed. If you can align your output directory with the module/workflow defaults, your output definition can be quite short (see fetchngs)
If a process maps some emits to some rules and then is invoked by a workflow, the workflow can re-map the process outputs to different rules and overwrite the process defaults, and so on with workflows and subworkflows, etc

Overall, everything is much more concise and more in line with what many people have suggested, to simply annotate the workflows with the publish paths. The output definition is no longer a comprehensive view of all outputs, but there is a degree of modularity, and you can be verbose in the output definition if you want to.

@adamrtalbot thanks for your comments, makes me feel more confident about the prototype. I think all of the remaining TODOs can be addressed by refactoring some dataflow logic, it can be handled by the rnaseq devs.

workflows/rnaseq/main.nf

pinin4fjords · 2024-04-12T08:14:10Z

Really liking the way this is going now, it's going to be very tidy.

Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?

    publish:
    ch_orig_bam         >> "star_salmon/intermeds/${meta.publish_suffix}/"

bentsherman · 2024-04-12T13:29:46Z

Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?

Maybe in a future iteration. But related to this, we are interested in building on the concept of the samplesheet as a way to hold metadata for file collections in general, and it might be a better practice than trying to encode metadata in the filenames.

For example Paolo has proposed that we have a method in the output DSL to automatically publish an index file for a given "rule":

output {
  directory 'results'

  'star_salmon/intermeds/' {
    index 'index.csv'
  }
}

star_salmon/intermeds/index.csv

sample_id,bam
sample001,results/star_salmon/intermeds/sample001.bam
sample002,results/star_salmon/intermeds/sample002.bam
sample003,results/star_salmon/intermeds/sample003.bam

Of course you could also do this manually like in fetchngs, and I would like to add a stdlib function like mergeCsv to make it easier, but the index method would be a convenient solution for the most common and simple cases. Either way, you can just query the index file instead of inspecting the file names.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman · 2024-04-24T22:25:00Z

The redirect to null simplifies the top-level publish def somewhat. The remaining rules could also be moved into the workflow defs since they only rename paths. It just might be more verbose since you would have to remap each channel instead of the target name.

It seems like the best delineation for what goes in the top-level publish block vs the workflow publish sections is, the workflows define what is published (including conditional logic) while the top-level publish def should define how things are being published (mode, whether to overwrite, content type, tags, etc). This is also good for modularity.

Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.

pinin4fjords · 2024-04-25T08:56:51Z

Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular.

We have been trying to eliminate that when we see it

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

ewels · 2024-05-06T14:03:09Z

main.nf

+output {
+    directory params.outdir
+    mode params.publish_dir_mode
+}


Can this be set in nextflow.config @bentsherman?

Not currently, but should be considered as a future improvement. If we don't need the target-specific config in the output block, would be better IMO to make these config settings instead of pipeline code + params.

I would prefer this as a scope in the config rather than in the main.nf.

I was also thinking to allow different modes for files matching different patterns or properties, e.g. large files are symlinked, rather than copied, but this is an extra.

I agree that they should be config. Much less flexible for the user if you have to define a param just to make it configurable

For now you can customize things like the mode for specific targets, though that it not as granular as what you describe

mahesh-panchal

I like how it simplifies things quite a bit.

My main issues with this is that the publish path is spread across so many files, which probably means we shouldn't be using this in nf-core in any of the resuable/modular parts, especially if we can't turn off anything without having params for everything.

Paths that need meta data like sample name are not here, but I guess they're directly accessible in the process?

mahesh-panchal · 2024-05-20T07:56:05Z

modules/local/deseq2_qc/main.nf

+    publish:
+    pdf             >> 'deseq2'


It was surprising to still see the publish directive in the process block. Why this way then instead of:

output: path "*.pdf" , optional:true, emit: pdf, publish: 'deseq2'

i.e. instead of making publish and extra option for the path type

I guess I wanted the publish definition to be in one place so that it's easier to review. But also, I think the separate publish: section will work better with static typed outputs

mahesh-panchal · 2024-05-20T07:58:47Z

main.nf

+output {
+    directory params.outdir
+    mode params.publish_dir_mode
+}


I would prefer this as a scope in the config rather than in the main.nf.

I was also thinking to allow different modes for files matching different patterns or properties, e.g. large files are symlinked, rather than copied, but this is an extra.

mahesh-panchal · 2024-05-20T08:02:20Z

modules/nf-core/cat/fastq/main.nf

@@ -14,6 +14,9 @@ process CAT_FASTQ {
    tuple val(meta), path("*.merged.fastq.gz"), emit: reads
    path "versions.yml"                       , emit: versions

+    publish:
+    reads   >> 'cat/fastq'


How would a pipeline developer turn this off? If I include this module from nf-core/modules in my own pipeline, but don't want to publish the output from this, what can I do?

You can redirect the output to null:

workflow { CAT_FASTQ() publish: CAT_FASTQ.out >> null }

And Paolo just added an enabled option so that you can control it from the output block:

output { 'cat/fastq' { enabled false } }

mahesh-panchal · 2024-05-20T08:16:09Z

subworkflows/local/align_star/main.nf

@@ -62,6 +62,16 @@ workflow ALIGN_STAR {
    BAM_SORT_STATS_SAMTOOLS ( ch_orig_bam, fasta )
    ch_versions = ch_versions.mix(BAM_SORT_STATS_SAMTOOLS.out.versions)

+    publish:
+    ch_orig_bam         >> (params.save_align_intermeds || params.save_umi_intermeds ? 'star_salmon/' : null)


This is fine for specific pipelines, but makes module reusability across pipelines more cumbersome. If we need to have ternary operators in every nf-core module to control whether an output is published, this would make the pipeline schema, potentially huge.

I agree, I cut some corners here by not passing the params as workflow inputs. In the final implementation I would do that so that you can choose whether to expose it as a param in your own pipeline

mahesh-panchal · 2024-05-20T08:28:04Z

subworkflows/nf-core/fastq_align_hisat2/main.nf

@@ -26,6 +26,10 @@ workflow FASTQ_ALIGN_HISAT2 {
    BAM_SORT_STATS_SAMTOOLS ( HISAT2_ALIGN.out.bam, ch_fasta )
    ch_versions = ch_versions.mix(BAM_SORT_STATS_SAMTOOLS.out.versions)

+    publish:
+    HISAT2_ALIGN.out.bam        >> (params.save_align_intermeds ? 'hisat2/' : null)


Is this going to cause problems/user-confusion having the possibility to publish in two places ( workflow and module, e.g. nf-core modules supplies a path, and then the workflow supplies another path )?

The idea is that workflow can overwrite the publish targets defined by processes / subworkflows. And if you can find a convention to agree on, you might not need to overwrite anything in the first place.

But I admit I'm not 100% sold on the publish: section for processes. It's more "modular" but if it ends up being overwritten most of the time then it's not very useful.

mahesh-panchal · 2024-05-20T08:40:29Z

workflows/rnaseq/main.nf

@@ -782,6 +782,127 @@ workflow RNASEQ {
        ch_multiqc_report = MULTIQC.out.report
    }

+    publish:
+    QUANTIFY_STAR_SALMON.out.results                        >> 'star_salmon/'


I think this is going to be annoying to users/developers not knowing where a publish path is defined when trying to debug why their file is being written to another location.

The main principle is that callers overwrite callees. But also I think we will extend the inspect command to also show the resolved publish targets for the entire pipeline, so that it is more clear.

Add output definition

783ff86

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Add output definition~~ Workflow output DSL Feb 28, 2024

Add complete output definition

b572f52

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

This was referenced Feb 29, 2024

Workflow output definition nf-core/fetchngs#275

Closed

Workflow output definition nextflow-io/nextflow#4784

Merged

This comment was marked as outdated.

Sign in to view

bentsherman added 2 commits March 18, 2024 15:24

Add explicit inclusion of subworkflow outputs

2a93c91

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Apply updates from upstream

daf61c7

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

maxulysse reviewed Apr 4, 2024

View reviewed changes

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved

maxulysse reviewed Apr 4, 2024

View reviewed changes

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved

pditommaso reviewed Apr 4, 2024

View reviewed changes

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved

pditommaso reviewed Apr 4, 2024

View reviewed changes

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved

pditommaso reviewed Apr 4, 2024

View reviewed changes

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved

bentsherman added 4 commits April 6, 2024 14:26

Move output definition to top-level workflow

69f258e

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Fix bugs

45ec2ef

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Disable processes not used by test profile (temporary)

9af6803

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Move topic mappings to modules and subworkflows

635203c

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman added 2 commits April 10, 2024 18:04

Update based on latest prototype

44af165

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Remove topic feature flag

3ee5753

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman commented Apr 11, 2024

View reviewed changes

workflows/rnaseq/main.nf Outdated Show resolved Hide resolved

bentsherman added 2 commits April 24, 2024 03:23

Rename output -> publish

6719138

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Redirect to null to disable publishing

434024e

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

bentsherman changed the title ~~Workflow output DSL~~ Workflow publish definition Apr 24, 2024

bentsherman added 3 commits April 29, 2024 17:56

Move path renames from publish block to workflow publish section

bd3d6ed

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Revert target renaming, re-map channels instead

54b1f6f

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

Rename publish -> output

a8248d3

Signed-off-by: Ben Sherman <bentshermann@gmail.com>

ewels reviewed May 6, 2024

View reviewed changes

bentsherman changed the title ~~Workflow publish definition~~ Workflow ouput definition May 17, 2024

bentsherman changed the title ~~Workflow ouput definition~~ Workflow output definition May 17, 2024

mahesh-panchal reviewed May 20, 2024

View reviewed changes

Workflow output definition #1227

Are you sure you want to change the base?

Workflow output definition #1227

Conversation

bentsherman commented Feb 28, 2024

github-actions bot commented Feb 28, 2024 • edited

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

pinin4fjords commented Apr 4, 2024

adamrtalbot commented Apr 4, 2024

maxulysse commented Apr 4, 2024

bentsherman commented Apr 4, 2024

bentsherman commented Apr 8, 2024

bentsherman commented Apr 9, 2024

pinin4fjords commented Apr 9, 2024

bentsherman commented Apr 9, 2024

bentsherman commented Apr 9, 2024

pinin4fjords commented Apr 10, 2024

adamrtalbot commented Apr 10, 2024 • edited

bentsherman commented Apr 11, 2024

pinin4fjords commented Apr 12, 2024

bentsherman commented Apr 12, 2024

bentsherman commented Apr 24, 2024

pinin4fjords commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahesh-panchal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2024 •

edited

`nf-core lint` overall result: Passed ✅ ⚠️

adamrtalbot commented Apr 10, 2024 •

edited