Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: checkpoint documentation #1562

Merged
merged 4 commits into from Apr 25, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
83 changes: 60 additions & 23 deletions docs/snakefiles/rules.rst
Expand Up @@ -999,6 +999,8 @@ Further, an output file marked as ``temp`` is deleted after all rules that use i
shell:
"somecommand {input} {output}"

.. _snakefiles-directory_output:

Directories as outputs
----------------------

Expand Down Expand Up @@ -1865,8 +1867,61 @@ Instead, the output file will be opened, and depending on its contents either ``
This way, the DAG becomes conditional on some produced data.

It is also possible to use checkpoints for cases where the output files are unknown before execution.
A typical example is a clustering process with an unknown number of clusters, where each cluster shall be saved into a separate file.
Consider the following example:
Consider the following example where an arbitrary number of files is generated by a rule before being aggregated:

.. code-block:: python

# a target rule to define the desired final output
rule all:
input:
"aggregated.txt"


# the checkpoint that shall trigger re-evaluation of the DAG
# an number of file is created in a defined directory
checkpoint somestep:
output:
directory("my_directory/")
shell:
"mkdir my_directory/;"
"for i in 1 2 3; do touch $i.txt; done"


# input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep'
def aggregate_input(wildcards):
checkpoint_output = checkpoints.export_sequences.get(**wildcards).output[0]
return expand("my_directory/{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)


rule aggregate:
input:
aggregate_input
output:
"aggegated.txt"
shell cat {input} > {output}
johanneskoester marked this conversation as resolved.
Show resolved Hide resolved

Because the number of output files is unknown beforehand, the checkpoint only defines an output :ref:`directory <snakefiles-directory_output>`.
This time, instead of explicitly writing

.. code-block:: python

checkpoints.clustering.get(sample=wildcards.sample).output[0]

we use the shorthand

.. code-block:: python

checkpoints.clustering.get(**wildcards).output[0]

which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensure that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
johanneskoester marked this conversation as resolved.
Show resolved Hide resolved
Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
Because the wildcard ``i`` is evaluated only after completion of the checkpoint, it is nescessay to use ``directory`` to declare its output, instead of using the full wildcard patterns as output.

A more practical example building on the previous one is a clustering process with an unknown number of clusters for different samples, where each cluster shall be saved into a separate file.
In this example the clusters are being processed by an intermediate rule before being aggregated:

.. code-block:: python

Expand Down Expand Up @@ -1914,27 +1969,9 @@ Consider the following example:
shell:
"cat {input} > {output}"

Here, our checkpoint simulates a clustering.
We pretend that the number of clusters is unknown beforehand.
Hence, the checkpoint only defines an output ``directory``.
The rule ``aggregate`` again uses the ``checkpoints`` object to retrieve the output of the checkpoint.
This time, instead of explicitly writing

.. code-block:: python

checkpoints.clustering.get(sample=wildcards.sample).output[0]

we use the shorthand

.. code-block:: python

checkpoints.clustering.get(**wildcards).output[0]

which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensure that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
These values are then used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.
Here a new directory will be created for each sample by the checkpoint.
After completion of the checkpoint, the ``aggregate_input`` function is re-evaluated as previously.
The values of the wildcard ``i`` is this time used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.


.. _snakefiles-rule-inheritance:
Expand Down