diff --git a/docs/project_info/faq.rst b/docs/project_info/faq.rst index 14084a40b..65c901bde 100644 --- a/docs/project_info/faq.rst +++ b/docs/project_info/faq.rst @@ -48,7 +48,7 @@ For debugging such cases, Snakemake provides the command line flag ``--debug-dag In addition, it is advisable to check whether certain intermediate files would be created by targetting them individually via the command line. -Finally, it is possible to constrain the rules that are considered for DAG creating via ``--allowed-rules``. +Finally, it is possible to constrain the rules that are considered for DAG creating via ``--allowed-rules``. This way, you can easily check rule by rule if it does what you expect. However, note that ``--allowed-rules`` is only meant for debugging. A workflow should always work fine without it. @@ -285,7 +285,7 @@ This will cause Snakemake to re-run all jobs of that rule and everything downstr How should Snakefiles be formatted? -------------------------------------- -To ensure readability and consistency, you can format Snakefiles with our tool `snakefmt `_. +To ensure readability and consistency, you can format Snakefiles with our tool `snakefmt `_. Python code gets formatted with `black `_ and Snakemake-specific blocks are formatted using similar principles (such as `PEP8 `_). @@ -484,6 +484,8 @@ Snakemake has a kind of "lazy" policy about added input files if their modificat Here, ``snakemake --list-input-changes`` returns the list of output files with changed input files, which is fed into ``-R`` to trigger a re-run. +It is worth mentioning that if the additional input files does not yet exist and can be found in outputs of another rules, the input files can be marked as `missing` to generate the missing dependencies and re-run the rule (see :ref:`snakefiles-missing-input`). + How do I trigger re-runs for rules with updated code or parameters? ------------------------------------------------------------------- diff --git a/docs/snakefiles/rules.rst b/docs/snakefiles/rules.rst index 56b630bb3..69d2b71b0 100644 --- a/docs/snakefiles/rules.rst +++ b/docs/snakefiles/rules.rst @@ -23,7 +23,7 @@ The name is optional and can be left out, creating an anonymous rule. It can als To avoid evaluation and replacement, you have to mask the braces by doubling them, i.e. ``{{input}}``. -Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the `python format minilanguage `_. +Inside the shell command, all local and global variables, especially input and output files can be accessed via their names in the `python format minilanguage `_. Here, input and output (and in general any list or tuple) automatically evaluate to a space-separated list of files (i.e. ``path/to/inputfile path/to/other/inputfile``). From Snakemake 3.8.0 on, adding the special formatting instruction ``:q`` (e.g. ``"somecommand {input:q} {output:q}")``) will let Snakemake quote each of the list or tuple elements that contains whitespace. @@ -141,7 +141,7 @@ Input files can be Python lists, allowing to easily aggregate over parameters or .. code-block:: python rule aggregate: - input: + input: ["{dataset}/a.txt".format(dataset=dataset) for dataset in DATASETS] output: "aggregated.txt" @@ -158,7 +158,7 @@ The expand function .. code-block:: python rule aggregate: - input: + input: expand("{dataset}/a.txt", dataset=DATASETS) output: "aggregated.txt" @@ -172,7 +172,7 @@ The ``expand`` function also allows us to combine different variables, e.g. .. code-block:: python rule aggregate: - input: + input: expand("{dataset}/a.{ext}", dataset=DATASETS, ext=FORMATS) output: "aggregated.txt" @@ -225,7 +225,7 @@ The multiext function .. code-block:: python rule plot: - input: + input: ... output: multiext("some/plot", ".pdf", ".svg", ".png") @@ -284,11 +284,11 @@ Further, a rule can be given a number of threads to use, i.e. .. sidebar:: Note On a cluster node, Snakemake uses as many cores as available on that node. - Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. + Hence, the number of threads used by a rule never exceeds the number of physically available cores on the node. Note: This behavior is not affected by ``--local-cores``, which only applies to jobs running on the main node. Snakemake can alter the number of cores available based on command line options. Therefore it is useful to propagate it via the built in variable ``threads`` rather than hardcoding it into the shell command. -In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. ``threads = min(threads, cores)`` with ``cores`` being the number of cores specified at the command line (option ``--cores``). +In particular, it should be noted that the specified threads have to be seen as a maximum. When Snakemake is executed with fewer cores, the number of threads will be adjusted, i.e. ``threads = min(threads, cores)`` with ``cores`` being the number of cores specified at the command line (option ``--cores``). Hardcoding a particular maximum number of threads like above is useful when a certain tool has a natural maximum beyond which parallelization won't help to further speed it up. This is often the case, and should be evaluated carefully for production workflows. @@ -401,12 +401,12 @@ There are three **standard resources**, for total memory, disk usage and the tem The ``tmpdir`` resource automatically leads to setting the TMPDIR variable for shell commands, scripts, wrappers and notebooks. When defining memory constraints, it is advised to use ``mem_mb``, because some execution modes make direct use of this information (e.g., when using :ref:`Kubernetes `). -Since it would be cumbersome to define such standard resources them for every rule, you can set default values at +Since it would be cumbersome to define such standard resources them for every rule, you can set default values at the terminal or in a :ref:`profile `. This works via the command line flag ``--default-resources``, see ``snakemake --help`` for more information. If those resource definitions are mandatory for a certain execution mode, Snakemake will fail with a hint if they are missing. Any resource definitions inside a rule override what has been defined with ``--default-resources``. -If ``--default-resources`` are not specified, Snakemake uses ``'mem_mb=max(2*input.size_mb, 1000)'``, +If ``--default-resources`` are not specified, Snakemake uses ``'mem_mb=max(2*input.size_mb, 1000)'``, ``'disk_mb=max(2*input.size_mb, 1000)'``, and ``'tmpdir=system_tmpdir'``. The latter points to whatever is the default of the operating system or specified by any of the environment variables ``$TMPDIR``, ``$TEMP``, or ``$TMP`` as outlined `here `_. @@ -447,7 +447,7 @@ Note that this is currently implemented for the Google Life Sciences API. GPU Resources ~~~~~~~~~~~~~ -The Google Life Sciences API currently has support for +The Google Life Sciences API currently has support for `NVIDIA GPUs `_, meaning that you can request a number of NVIDIA GPUs explicitly by adding ``nvidia_gpu`` or ``gpu`` to your Snakefile resources for a step: @@ -961,7 +961,7 @@ When using other languages than Python in the notebook, one needs to additionall When using an IDE with built-in Jupyter support, an alternative to ``--edit-notebook`` is ``--draft-notebook``. Instead of firing up a notebook server, ``--draft-notebook`` just creates a skeleton notebook for editing within the IDE. -In addition, it prints instructions for configuring the IDE's notebook environment to use the interpreter from the +In addition, it prints instructions for configuring the IDE's notebook environment to use the interpreter from the Conda environment defined in the corresponding rule. For example, running @@ -1001,6 +1001,61 @@ Further, an output file marked as ``temp`` is deleted after all rules that use i shell: "somecommand {input} {output}" +.. _snakefiles-missing-input: + +Handling New Input files +------------------------ + +When using a rule that aggregates many input files with the output already existing because of a previous run, if non existing new input files are added to the rule, the generation of these new files and the rerun of the rule will not be triggered automatically. + +A first solution is to ask Snakemake to list input changes and force the targets as follow: + +.. code-block:: console + + $ snakemake -n -R `snakemake --list-input-changes` + +Another solution is to mark input files as ``missing``, thus newly added files will be generated and the rule will be re-run to update the rule's output. + +Let's consider the following Snakefile: + +.. code-block:: python + + NAMES = config.get("names", "john").split(",") + + rule all: + input: + lambda wildcards: [f"hello-{name}" for name in NAMES] + output: "all.txt" + run: + with open(output[0], "w") as fout: + fout.write("\n".join(input)) + + rule A: + output: touch("hello-{name}") + +The first Snakemake run is done using the command `snakemake -j1`, it produces the file `hello-john` and consequently the file `all.txt` with the line `hello-john` inside. + +A second Snakemake run is done using the command `snakemake -j1 --config names=john,doe` but neither the file `hello-doe` nor the file `all.txt` are (re)generated. Moreover Snakemake warns that the input files have changed. + +By marking input files in the `all` rule as `missing`, missing (new) input files will be generated using the rule `A` and then the rule `all` will be rerun. + + +.. code-block:: python + + NAMES = config.get("names", "john").split(",") + + rule all: + input: + missing(lambda wildcards: [f"hello-{name}" for name in NAMES]) + output: "all.txt" + run: + with open(output[0], "w") as fout: + fout.write("\n".join(input)) + + rule A: + output: touch("hello-{name}") + + Directories as outputs ---------------------- @@ -1422,7 +1477,7 @@ For example snakemake --set-scatter split=2 -would set the number of scatter items for the split process defined above to 2 instead of 8. +would set the number of scatter items for the split process defined above to 2 instead of 8. This allows to adapt parallelization according to the needs of the underlying computing platform and the analysis at hand. .. _snakefiles-grouping: @@ -1585,7 +1640,7 @@ Consider the following example: service("foo.socket") shell: # here we simulate some kind of server process that provides data via a socket - "ln -s /dev/random {output}; sleep 10000" + "ln -s /dev/random {output}; sleep 10000" rule consumer1: @@ -1627,7 +1682,7 @@ This works by combining the service job pattern from above with the :ref:`group- service("foo.{groupid}.socket") shell: # here we simulate some kind of server process that provides data via a socket - "ln -s /dev/random {output}; sleep 10000" + "ln -s /dev/random {output}; sleep 10000" def get_socket(wildcards, groupid): @@ -1657,7 +1712,7 @@ Parameter space exploration --------------------------- The basic Snakemake functionality already provides everything to handle parameter spaces in any way (sub-spacing for certain rules and even depending on wildcard values, the ability to read or generate spaces on the fly or from files via pandas, etc.). -However, it usually would require some boilerplate code for translating a parameter space into wildcard patterns, and translate it back into concrete parameters for scripts and commands. +However, it usually would require some boilerplate code for translating a parameter space into wildcard patterns, and translate it back into concrete parameters for scripts and commands. From Snakemake 5.31 on (inspired by `JUDI `_), this is solved via the Paramspace helper, which can be used as follows: .. code-block:: python @@ -1672,14 +1727,14 @@ From Snakemake 5.31 on (inspired by `JUDI `_), th rule all: input: # Aggregate over entire parameter space (or a subset thereof if needed) - # of course, something like this can happen anywhere in the workflow (not + # of course, something like this can happen anywhere in the workflow (not # only at the end). expand("results/plots/{params}.pdf", params=paramspace.instance_patterns) rule simulate: output: - # format a wildcard pattern like "alpha~{alpha}/beta~{beta}/gamma~{gamma}" + # format a wildcard pattern like "alpha~{alpha}/beta~{beta}/gamma~{gamma}" # into a file path, with alpha, beta, gamma being the columns of the data frame f"results/simulations/{paramspace.wildcard_pattern}.tsv" params: @@ -1717,35 +1772,35 @@ This workflow will run as follows: [Fri Nov 27 20:57:27 2020] rule simulate: - output: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv - jobid: 4 - wildcards: alpha=2.0, beta=0.0, gamma=3.9 + output: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv + jobid: 4 + wildcards: alpha=2.0, beta=0.0, gamma=3.9 [Fri Nov 27 20:57:27 2020] rule simulate: - output: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv - jobid: 2 - wildcards: alpha=1.0, beta=0.1, gamma=0.99 + output: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv + jobid: 2 + wildcards: alpha=1.0, beta=0.1, gamma=0.99 [Fri Nov 27 20:57:27 2020] rule plot: - input: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv - output: results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf - jobid: 3 - wildcards: alpha=2.0, beta=0.0, gamma=3.9 + input: results/simulations/alpha~2.0/beta~0.0/gamma~3.9.tsv + output: results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf + jobid: 3 + wildcards: alpha=2.0, beta=0.0, gamma=3.9 [Fri Nov 27 20:57:27 2020] rule plot: - input: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv - output: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf - jobid: 1 - wildcards: alpha=1.0, beta=0.1, gamma=0.99 + input: results/simulations/alpha~1.0/beta~0.1/gamma~0.99.tsv + output: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf + jobid: 1 + wildcards: alpha=1.0, beta=0.1, gamma=0.99 [Fri Nov 27 20:57:27 2020] localrule all: - input: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf, results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf + input: results/plots/alpha~1.0/beta~0.1/gamma~0.99.pdf, results/plots/alpha~2.0/beta~0.0/gamma~3.9.pdf jobid: 0 @@ -2006,7 +2061,7 @@ Template rendering rules may only have a single input and output file. The template_engine instruction has to be specified at the end of the rule. The template itself has access to ``params``, ``wildcards``, and ``config``, -which are the same objects you can use for example in the ``shell`` or ``run`` directive, +which are the same objects you can use for example in the ``shell`` or ``run`` directive, and the same objects as can be accessed from ``script`` or ``notebook`` directives (but in the latter two cases they are stored behind the ``snakemake`` object which serves as a dedicated namespace to avoid name clashes). An example Jinja2 template could look like this: @@ -2044,4 +2099,4 @@ Analogously to the jinja2 case YTE has access to ``params``, ``wildcards``, and - b - ?config["threshold"] -Template rendering rules are always executed locally, without submission to cluster or cloud processes (since templating is usually not resource intensive). \ No newline at end of file +Template rendering rules are always executed locally, without submission to cluster or cloud processes (since templating is usually not resource intensive). diff --git a/snakemake/dag.py b/snakemake/dag.py index d768728bb..1ef3c00df 100755 --- a/snakemake/dag.py +++ b/snakemake/dag.py @@ -996,7 +996,12 @@ def update_needrun(job): output_mintime_ = output_mintime.get(job) if output_mintime_: updated_input = [ - f for f in job.input if f.exists and f.is_newer(output_mintime_) + f + for f in job.input + if ( + (f.exists and f.is_newer(output_mintime_)) + or (not f.exists and is_flagged(f, "missing")) + ) ] reason.updated_input.update(updated_input) if noinitreason and reason: diff --git a/snakemake/io.py b/snakemake/io.py index f6772cdb3..71da0c2a5 100755 --- a/snakemake/io.py +++ b/snakemake/io.py @@ -987,6 +987,13 @@ def ancient(value): return flag(value, "ancient") +def missing(value): + """ + Re run if new input files are missing; ie missing files will be generated first and then the considered rule is regenerated. + """ + return flag(value, "missing") + + def directory(value): """ A flag to specify that output is a directory, rather than a file or named pipe. diff --git a/snakemake/rules.py b/snakemake/rules.py index 45730db1c..6649510d1 100644 --- a/snakemake/rules.py +++ b/snakemake/rules.py @@ -20,6 +20,7 @@ _IOFile, protected, temp, + missing, dynamic, Namedlist, AnnotatedString, @@ -774,6 +775,7 @@ def _apply_wildcards( for name, item in olditems._allitems(): start = len(newitems) is_unpack = is_flagged(item, "unpack") + is_missing = is_flagged(item, "missing") _is_callable = is_callable(item) if _is_callable: @@ -831,6 +833,10 @@ def _apply_wildcards( if from_callable and apply_path_modifier and not incomplete: item_ = self.apply_path_modifier(item_, property=property) + # Forward the missing flag is necessary + if is_missing: + item_ = missing(item_) + concrete = concretize(item_, wildcards, _is_callable) newitems.append(concrete) if mapping is not None: diff --git a/snakemake/workflow.py b/snakemake/workflow.py index 12de6fd2d..933b68ead 100644 --- a/snakemake/workflow.py +++ b/snakemake/workflow.py @@ -41,6 +41,7 @@ temp, temporary, ancient, + missing, directory, expand, dynamic, diff --git a/tests/common.py b/tests/common.py index ef1532eaf..965ed1bbe 100644 --- a/tests/common.py +++ b/tests/common.py @@ -98,6 +98,7 @@ def run( snakefile="Snakefile", subpath=None, no_tmpdir=False, + tmpdir=None, check_md5=True, check_results=True, cores=3, @@ -134,9 +135,10 @@ def run( ), "{} does not exist".format(results_dir) # If we need to further check results, we won't cleanup tmpdir - tmpdir = next(tempfile._get_candidate_names()) - tmpdir = os.path.join(tempfile.gettempdir(), "snakemake-%s" % tmpdir) - os.mkdir(tmpdir) + if not tmpdir: + tmpdir = next(tempfile._get_candidate_names()) + tmpdir = os.path.join(tempfile.gettempdir(), "snakemake-%s" % tmpdir) + os.mkdir(tmpdir) config = dict(config) diff --git a/tests/test_update_input/Snakefile b/tests/test_update_input/Snakefile new file mode 100644 index 000000000..7c3fe530e --- /dev/null +++ b/tests/test_update_input/Snakefile @@ -0,0 +1,34 @@ +rule all: + input: "A1.txt", "A2.txt" + +rule A: + input: "A{index}.tmp" + output: "A{index}.txt" + shell: "cp {input} {output}" + + +rule A_TMP_1: + input: + "B-fred.txt" + + output: + temp("A1.tmp") + + run: + f = open(output[0], "w") + f.write(' '.join(input) + "\n") + +rule A_TMP_2: + input: + missing(lambda wildcards: [rules.B.output[0].format(name=name) + for name in config.get("names", "john").split(",")]) + output: + temp("A2.tmp") + + run: + f = open(output[0], "w") + f.write(' '.join(input) + "\n") + +rule B: + output: + touch("B-{name}.txt") diff --git a/tests/test_update_input/expected-results/A1.txt b/tests/test_update_input/expected-results/A1.txt new file mode 100644 index 000000000..934266422 --- /dev/null +++ b/tests/test_update_input/expected-results/A1.txt @@ -0,0 +1 @@ +B-fred.txt diff --git a/tests/test_update_input/expected-results/A2.txt b/tests/test_update_input/expected-results/A2.txt new file mode 100644 index 000000000..55da7d815 --- /dev/null +++ b/tests/test_update_input/expected-results/A2.txt @@ -0,0 +1 @@ +B-john.txt B-doe.txt diff --git a/tests/test_update_input/expected-results/B-doe.txt b/tests/test_update_input/expected-results/B-doe.txt new file mode 100644 index 000000000..e69de29bb diff --git a/tests/test_update_input/expected-results/B-fred.txt b/tests/test_update_input/expected-results/B-fred.txt new file mode 100644 index 000000000..e69de29bb diff --git a/tests/test_update_input/expected-results/B-john.txt b/tests/test_update_input/expected-results/B-john.txt new file mode 100644 index 000000000..e69de29bb diff --git a/tests/tests.py b/tests/tests.py index 329162eb2..538149cdd 100644 --- a/tests/tests.py +++ b/tests/tests.py @@ -1534,3 +1534,35 @@ def test_groupid_expand_cluster(): @skip_on_windows def test_service_jobs(): run(dpath("test_service_jobs"), check_md5=False) + + +def test_update_input(): + try: + # First run + tmpdir = run(dpath("test_update_input"), cleanup=False, check_results=False) + a1_txt = os.path.join(tmpdir, "A1.txt") + a2_txt = os.path.join(tmpdir, "A2.txt") + john_txt = os.path.join(tmpdir, "B-john.txt") + mtime_a1_txt = os.path.getmtime(a1_txt) + mtime_a2_txt = os.path.getmtime(a2_txt) + mtime_john_txt = os.path.getmtime(john_txt) + + # Prepare the update run with new values in the input function of rule A_TMP_2 + shutil.rmtree(os.path.join(tmpdir, "expected-results")) + shutil.rmtree(os.path.join(tmpdir, ".snakemake")) + run( + dpath("test_update_input"), + config={"names": "john,doe"}, + cores=1, + tmpdir=tmpdir, + cleanup=False, + ) + + # Check that A1.txt is left untouched. + assert os.path.getmtime(a1_txt) == mtime_a1_txt + # Check that A2.txt has been regenerated + assert os.path.getmtime(a2_txt) > mtime_a2_txt + # Check that john.txt has left untouched + assert os.path.getmtime(john_txt) == mtime_john_txt + finally: + shutil.rmtree(tmpdir)