fix: for nonpareil, use pigz and pbzip2 and auto infer of -X (#1776)

### Description  - Auto infer `-X` value - use `pigz` and `pbzip2` ### QC  * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).
snakemake · Oct 30, 2023 · 45860bf · 45860bf
1 parent 66940e3
commit 45860bf
Show file tree

Hide file tree

Showing 4 changed files with 30 additions and 5 deletions.
diff --git a/bio/nonpareil/infer/environment.yaml b/bio/nonpareil/infer/environment.yaml
@@ -4,4 +4,6 @@ channels:
   - nodefaults
 dependencies:
   - nonpareil =3.4.1
+  - pigz
+  - pbzip2
   - snakemake-wrapper-utils =0.6.2
diff --git a/bio/nonpareil/infer/meta.yaml b/bio/nonpareil/infer/meta.yaml
@@ -12,6 +12,7 @@ output:
   - log: log of internal Nonpareil processing.
 params:
   - alg: nonpareil algorithm, either `kmer` or `alignment` (mandatory).
-  - extra: additional program arguments
+  - infer_X: automatically infer value of `-X` (couple of minutes slower to count number of reads)
+  - extra: additional program arguments (not `-X` if infer_X == True)
 notes: |
   * For a PDF version of the manual, see https://nonpareil.readthedocs.io/_/downloads/en/latest/pdf/
diff --git a/bio/nonpareil/infer/test/Snakefile b/bio/nonpareil/infer/test/Snakefile
@@ -10,7 +10,8 @@ rule nonpareil:
         "logs/{sample}.log",
     params:
         alg="kmer",
-        extra="-X 1 -k 3 -F",
+        infer_X=True,
+        extra="-k 3 -F",
     threads: 2
     resources:
         mem_mb=50,

diff --git a/bio/nonpareil/infer/wrapper.py b/bio/nonpareil/infer/wrapper.py
@@ -15,7 +15,11 @@
 uncomp = ""
 in_name, in_ext = path.splitext(snakemake.input[0])
 if in_ext in [".gz", ".bz2"]:
-    uncomp = "zcat" if in_ext == ".gz" else "bzcat"
+    uncomp = (
+        f"pigz --processes {snakemake.threads} --decompress --stdout"
+        if in_ext == ".gz"
+        else f"pbzip2 -p{snakemake.threads} --decompress --stdout"
+    )
     in_name, in_ext = path.splitext(in_name)
 
 # Infer output format
@@ -48,10 +52,27 @@
 
 
 with tempfile.NamedTemporaryFile() as tmp:
-    in_uncomp = snakemake.input[0]
     if uncomp:
         in_uncomp = tmp.name
-        shell("{uncomp} {snakemake.input[0]} > {in_uncomp}")
+        shell("{uncomp} {snakemake.input[0]} > {tmp.name}")
+    else:
+        in_uncomp = snakemake.input[0]
+
+    # Auto infer -X value
+    if snakemake.params.get("infer_X", True):
+        # Get total number of lines
+        total_n_lines = sum(1 for line in open(in_uncomp, "rb"))
+        # Get total number of reads (depends on format)
+        total_n_reads = total_n_lines / 4 if in_format == "fastq" else total_n_lines / 2
+        # Get total number of reads to sample
+        sample_n_reads = max(1, int(total_n_reads * 0.1) - 1)
+        # Get total number of reads to sample, depending on defaults
+        sample_n_reads = (
+            min(1000, sample_n_reads)
+            if snakemake.params.alg == "alignment"
+            else min(10000, sample_n_reads)
+        )
+        extra += f" -X {sample_n_reads}"
 
     shell(
         "nonpareil"