From 4cbfb4786a729a0c899a0a3e0427c1c1f0796c15 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Gr=C3=A9goire=20Denay?= <gregoire.denay@cvua-rrw.de>
Date: Mon, 25 Apr 2022 12:13:16 +0200
Subject: [PATCH] docs: checkpoint documentation (#1562)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* first draft at modifying the  documentation

* reworked, fix typo, link to directory article

* Update docs/snakefiles/rules.rst

* Update docs/snakefiles/rules.rst

Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
---
 docs/snakefiles/rules.rst | 84 ++++++++++++++++++++++++++++-----------
 1 file changed, 61 insertions(+), 23 deletions(-)

diff --git a/docs/snakefiles/rules.rst b/docs/snakefiles/rules.rst
index 80efe2175..972d40318 100644
--- a/docs/snakefiles/rules.rst
+++ b/docs/snakefiles/rules.rst
@@ -999,6 +999,8 @@ Further, an output file marked as ``temp`` is deleted after all rules that use i
         shell:
             "somecommand {input} {output}"
 
+.. _snakefiles-directory_output:
+
 Directories as outputs
 ----------------------
 
@@ -1865,8 +1867,62 @@ Instead, the output file will be opened, and depending on its contents either ``
 This way, the DAG becomes conditional on some produced data.
 
 It is also possible to use checkpoints for cases where the output files are unknown before execution.
-A typical example is a clustering process with an unknown number of clusters, where each cluster shall be saved into a separate file.
-Consider the following example:
+Consider the following example where an arbitrary number of files is generated by a rule before being aggregated:
+
+.. code-block:: python
+
+  # a target rule to define the desired final output
+  rule all:
+      input:
+          "aggregated.txt"
+
+
+  # the checkpoint that shall trigger re-evaluation of the DAG
+  # an number of file is created in a defined directory
+  checkpoint somestep:
+      output:
+          directory("my_directory/")
+      shell:
+          "mkdir my_directory/;"
+          "for i in 1 2 3; do touch $i.txt; done"
+
+
+  # input function for rule aggregate, return paths to all files produced by the checkpoint 'somestep'
+  def aggregate_input(wildcards):
+      checkpoint_output = checkpoints.export_sequences.get(**wildcards).output[0]
+      return expand("my_directory/{i}.txt",
+                    i=glob_wildcards(os.path.join(checkpoint_output, "{i}.txt")).i)
+
+
+  rule aggregate:
+      input:
+          aggregate_input
+      output:
+          "aggegated.txt"
+      shell:
+          "cat {input} > {output}"
+
+Because the number of output files is unknown beforehand, the checkpoint only defines an output :ref:`directory <snakefiles-directory_output>`.
+This time, instead of explicitly writing
+
+.. code-block:: python
+
+  checkpoints.clustering.get(sample=wildcards.sample).output[0]
+
+we use the shorthand
+
+.. code-block:: python
+
+  checkpoints.clustering.get(**wildcards).output[0]
+
+which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
+If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensures that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
+Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
+Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
+Because the wildcard ``i`` is evaluated only after completion of the checkpoint, it is nescessay to use ``directory`` to declare its output, instead of using the full wildcard patterns as output.
+
+A more practical example building on the previous one is a clustering process with an unknown number of clusters for different samples, where each cluster shall be saved into a separate file.
+In this example the clusters are being processed by an intermediate rule before being aggregated:
 
 .. code-block:: python
 
@@ -1914,27 +1970,9 @@ Consider the following example:
       shell:
           "cat {input} > {output}"
 
-Here, our checkpoint simulates a clustering.
-We pretend that the number of clusters is unknown beforehand.
-Hence, the checkpoint only defines an output ``directory``.
-The rule ``aggregate`` again uses the ``checkpoints`` object to retrieve the output of the checkpoint.
-This time, instead of explicitly writing
-
-.. code-block:: python
-
-  checkpoints.clustering.get(sample=wildcards.sample).output[0]
-
-we use the shorthand
-
-.. code-block:: python
-
-  checkpoints.clustering.get(**wildcards).output[0]
-
-which automatically unpacks the wildcards as keyword arguments (this is standard python argument unpacking).
-If the checkpoint has not yet been executed, accessing ``checkpoints.clustering.get(**wildcards)`` ensure that Snakemake records the checkpoint as a direct dependency of the rule ``aggregate``.
-Upon completion of the checkpoint, the input function is re-evaluated, and the code beyond its first line is executed.
-Here, we retrieve the values of the wildcard ``i`` based on all files named ``{i}.txt`` in the output directory of the checkpoint.
-These values are then used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.
+Here a new directory will be created for each sample by the checkpoint.
+After completion of the checkpoint, the ``aggregate_input`` function is re-evaluated as previously. 
+The values of the wildcard ``i`` is this time used to expand the pattern ``"post/{sample}/{i}.txt"``, such that the rule ``intermediate`` is executed for each of the determined clusters.
 
 
 .. _snakefiles-rule-inheritance: