From 4cbfb4786a729a0c899a0a3e0427c1c1f0796c15 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Gr=C3=A9goire=20Denay?= Date: Mon, 25 Apr 2022 12:13:16 +0200 Subject: [PATCH] docs: checkpoint documentation (#1562) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * first draft at modifying the documentation * reworked, fix typo, link to directory article * Update docs/snakefiles/rules.rst * Update docs/snakefiles/rules.rst Co-authored-by: Johannes Köster --- docs/snakefiles/rules.rst | 84 ++++++++++++++++++++++++++++----------- 1 file changed, 61 insertions(+), 23 deletions(-) diff --git a/docs/snakefiles/rules.rst b/docs/snakefiles/rules.rst index 80efe2175..972d40318 100644 --- a/docs/snakefiles/rules.rst +++ b/docs/snakefiles/rules.rst @@ -999,6 +999,8 @@ Further, an output file marked as ``temp`` is deleted after all rules that use i shell: "somecommand {input} {output}" +.. _snakefiles-directory_output: + Directories as outputs ---------------------- @@ -1865,8 +1867,62 @@ Instead, the output file will be opened, and depending on its contents either `` This way, the DAG becomes conditional on some produced data. It is also possible to use checkpoints for cases where the output files are unknown before execution. -A typical example is a clustering process with an unknown number of clusters, where each cluster shall be saved into a separate file. -Consider the following example: +Consider the following example where an arbitrary number of files is generated by a rule before being aggregated: + +.. code-block:: python + + # a target rule to define the desired final output + rule all: + input: + "aggregated.txt" + + + # the checkpoint that shall trigger re-evaluation of the DAG + # an number of file is created in a defined directory + checkpoint somestep: + output: + directory("my_directory/") + shell: + "mkdir my_directory/;" + "for i in 1 2 3; do touch $i.txt; done" + + + # input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep' + def aggregate_input(wildcards): + checkpoint_output = checkpoints.export_sequences.get(**wildcards).output[0] + return expand("my_directory/{i}.txt", + i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i) + + + rule aggregate: + input: + aggregate_input + output: + "aggegated.txt" + shell: + "cat {input} > {output}" + +Because the number of output files is unknown beforehand, the checkpoint only defines an output :ref:`directory `. +This time, instead of explicitly writing + +.. code-block:: python + + checkpoints.clustering.get(sample=wildcards.sample).output[0] + +we use the shorthand + +.. code-block:: python + + checkpoints.clustering.get(**wildcards).output[0] + +which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking). +If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensures that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``. +Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed. +Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint. +Because the wildcard ``i`` is evaluated only after completion of the checkpoint, it is nescessay to use ``directory`` to declare its output, instead of using the full wildcard patterns as output. + +A more practical example building on the previous one is a clustering process with an unknown number of clusters for different samples, where each cluster shall be saved into a separate file. +In this example the clusters are being processed by an intermediate rule before being aggregated: .. code-block:: python @@ -1914,27 +1970,9 @@ Consider the following example: shell: "cat {input} > {output}" -Here, our checkpoint simulates a clustering. -We pretend that the number of clusters is unknown beforehand. -Hence, the checkpoint only defines an output ``directory``. -The rule ``aggregate`` again uses the ``checkpoints`` object to retrieve the output of the checkpoint. -This time, instead of explicitly writing - -.. code-block:: python - - checkpoints.clustering.get(sample=wildcards.sample).output[0] - -we use the shorthand - -.. code-block:: python - - checkpoints.clustering.get(**wildcards).output[0] - -which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking). -If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensure that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``. -Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed. -Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint. -These values are then used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters. +Here a new directory will be created for each sample by the checkpoint. +After completion of the checkpoint, the ``aggregate_input`` function is re-evaluated as previously. +The values of the wildcard ``i`` is this time used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters. .. _snakefiles-rule-inheritance: