Skip to content

Commit

Permalink
fix: for nonpareil, use pigz and pbzip2 and auto infer of -X (#1776)
Browse files Browse the repository at this point in the history
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

### Description

<!-- Add a description of your PR here-->

- Auto infer `-X` value
- use `pigz` and `pbzip2`

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
  • Loading branch information
fgvieira committed Oct 30, 2023
1 parent 66940e3 commit 45860bf
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 5 deletions.
2 changes: 2 additions & 0 deletions bio/nonpareil/infer/environment.yaml
Expand Up @@ -4,4 +4,6 @@ channels:
- nodefaults
dependencies:
- nonpareil =3.4.1
- pigz
- pbzip2
- snakemake-wrapper-utils =0.6.2
3 changes: 2 additions & 1 deletion bio/nonpareil/infer/meta.yaml
Expand Up @@ -12,6 +12,7 @@ output:
- log: log of internal Nonpareil processing.
params:
- alg: nonpareil algorithm, either `kmer` or `alignment` (mandatory).
- extra: additional program arguments
- infer_X: automatically infer value of `-X` (couple of minutes slower to count number of reads)
- extra: additional program arguments (not `-X` if infer_X == True)
notes: |
* For a PDF version of the manual, see https://nonpareil.readthedocs.io/_/downloads/en/latest/pdf/
3 changes: 2 additions & 1 deletion bio/nonpareil/infer/test/Snakefile
Expand Up @@ -10,7 +10,8 @@ rule nonpareil:
"logs/{sample}.log",
params:
alg="kmer",
extra="-X 1 -k 3 -F",
infer_X=True,
extra="-k 3 -F",
threads: 2
resources:
mem_mb=50,
Expand Down
27 changes: 24 additions & 3 deletions bio/nonpareil/infer/wrapper.py
Expand Up @@ -15,7 +15,11 @@
uncomp = ""
in_name, in_ext = path.splitext(snakemake.input[0])
if in_ext in [".gz", ".bz2"]:
uncomp = "zcat" if in_ext == ".gz" else "bzcat"
uncomp = (
f"pigz --processes {snakemake.threads} --decompress --stdout"
if in_ext == ".gz"
else f"pbzip2 -p{snakemake.threads} --decompress --stdout"
)
in_name, in_ext = path.splitext(in_name)

# Infer output format
Expand Down Expand Up @@ -48,10 +52,27 @@


with tempfile.NamedTemporaryFile() as tmp:
in_uncomp = snakemake.input[0]
if uncomp:
in_uncomp = tmp.name
shell("{uncomp} {snakemake.input[0]} > {in_uncomp}")
shell("{uncomp} {snakemake.input[0]} > {tmp.name}")
else:
in_uncomp = snakemake.input[0]

# Auto infer -X value
if snakemake.params.get("infer_X", True):
# Get total number of lines
total_n_lines = sum(1 for line in open(in_uncomp, "rb"))
# Get total number of reads (depends on format)
total_n_reads = total_n_lines / 4 if in_format == "fastq" else total_n_lines / 2
# Get total number of reads to sample
sample_n_reads = max(1, int(total_n_reads * 0.1) - 1)
# Get total number of reads to sample, depending on defaults
sample_n_reads = (
min(1000, sample_n_reads)
if snakemake.params.alg == "alignment"
else min(10000, sample_n_reads)
)
extra += f" -X {sample_n_reads}"

shell(
"nonpareil"
Expand Down

0 comments on commit 45860bf

Please sign in to comment.