Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Bwa mem threads #1743

Merged
merged 53 commits into from Aug 25, 2023
Merged

Conversation

tdayris
Copy link
Contributor

@tdayris tdayris commented Aug 16, 2023

Description

Currently, master/bio/bwa-mem2/mem uses more threads than the number requested by user: for x threads requested by user, x is used in bwa-mem. Then either x are used by samtools (view or sort), or 1 by Picard. So for x threads requested by user, from x+1 to 2x threads are used.

This leads to bwa-mem jobs using official wrappers being killed by cluster manager for not respecting fair use rules where I work. Coworkers asked for a fix.

In this PR, I introduce a function to split the total number of threads between bwa-mem in one hand, and samtools/picard in the other hand.

Ultimately, the introduced changes also allow an uncompressed SAM output.

Tests have been modified, since this wrapper now requires at least two threads when samtools/picard are used. No more when sam output is requested.

QC

  • I confirm that:

For all wrappers added by this PR,

  • there is a test case which covers any introduced changes,
  • input: and output: file paths in the resulting rule can be changed arbitrarily,
  • either the wrapper can only use a single core, or the example rule contains a threads: x statement with x being a reasonable default,
  • rule names in the test case are in snake_case and somehow tell what the rule is about or match the tools purpose or name (e.g., map_reads for a step that maps reads),
  • all environment.yaml specifications follow the respective best practices,
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:),
  • all fields of the example rules in the Snakefiles and their entries are explained via comments (input:/output:/params: etc.),
  • stderr and/or stdout are logged correctly (log:), depending on the wrapped tool,
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to (see here; this also means that using any Python tempfile default behavior works),
  • the meta.yaml contains a link to the documentation of the respective tool or command,
  • Snakefiles pass the linting (snakemake --lint),
  • Snakefiles are formatted with snakefmt,
  • Python wrapper scripts are formatted with black.
  • Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

tdayris and others added 30 commits September 21, 2020 09:16
* perf: update bio/bcftools/index/environment.yaml.

* perf: update bio/bcftools/index/environment.yaml.

* perf: update bio/bcftools/index/environment.yaml.
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
* Add autobump action

* fix paths

* dbg

* dbg branch

* add checkout

* dbg

* trigger rerun

* entity regex and add label

* dbg

* Update autobump.yml

* Update autobump.yml
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
johanneskoester and others added 17 commits October 13, 2022 14:25
Automatic update of bio/deepvariant.

Co-authored-by: snakedeploy-bot[bot] <115615832+snakedeploy-bot[bot]@users.noreply.github.com>
Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
<!-- Ensure that the PR title follows conventional commit style (<type>:
<description>)-->
<!-- Possible types are here:
https://github.com/commitizen/conventional-commit-types/blob/master/index.json
-->

### Description

<!-- Add a description of your PR here-->

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] I confirm that:

For all wrappers added by this PR, 

* there is a test case which covers any introduced changes,
* `input:` and `output:` file paths in the resulting rule can be changed
arbitrarily,
* either the wrapper can only use a single core, or the example rule
contains a `threads: x` statement with `x` being a reasonable default,
* rule names in the test case are in
[snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell
what the rule is about or match the tools purpose or name (e.g.,
`map_reads` for a step that maps reads),
* all `environment.yaml` specifications follow [the respective best
practices](https://stackoverflow.com/a/64594513/2352071),
* wherever possible, command line arguments are inferred and set
automatically (e.g. based on file extensions in `input:` or `output:`),
* all fields of the example rules in the `Snakefile`s and their entries
are explained via comments (`input:`/`output:`/`params:` etc.),
* `stderr` and/or `stdout` are logged correctly (`log:`), depending on
the wrapped tool,
* temporary files are either written to a unique hidden folder in the
working directory, or (better) stored where the Python function
`tempfile.gettempdir()` points to (see
[here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir);
this also means that using any Python `tempfile` default behavior
works),
* the `meta.yaml` contains a link to the documentation of the respective
tool or command,
* `Snakefile`s pass the linting (`snakemake --lint`),
* `Snakefile`s are formatted with
[snakefmt](https://github.com/snakemake/snakefmt),
* Python wrapper scripts are formatted with
[black](https://black.readthedocs.io).
* Conda environments use a minimal amount of channels, in recommended
ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as
conda-forge should have highest priority and defaults channels are
usually not needed because most packages are in conda-forge nowadays).
@fgvieira
Copy link
Collaborator

fgvieira commented Aug 16, 2023

This issue has appeared before in #32 and at the time it was concluded it was fine, since bwa and samtools/picard never run at the same time. Is this not the case for bwa-mem2?

@tdayris
Copy link
Contributor Author

tdayris commented Aug 21, 2023

/usr/bin/time -v bwa-mem2 mem -t 4 ... | samtools sort -@ 3 ... on 10 real human WES datasets.

Should return a percentage of CPU used close-to-but-below 800% if I were right, or close-to-but-below 400% if you were right. My tests shows this number being 470% ± 15%.

Bwa and Samtools do not run with all their workers in the same time. But at one moment, we go over the value of 4 threads and almost reach 5 threads.

My solution is not optimal as it would largely overestimate the number of threads used.

bio/bwa-mem2/mem/test/Snakefile Outdated Show resolved Hide resolved
@fgvieira
Copy link
Collaborator

fgvieira commented Aug 21, 2023

Should return a percentage of CPU used close-to-but-below 800% if I were right, or close-to-but-below 400% if you were right. My tests shows this number being 470% ± 15%.

If so, can't we just give samtools (or whichever is not the limiting step) one thread less?
So, when running the wrapper with 4 threads, it would run /usr/bin/time -v bwa-mem2 mem -t 4 ... | samtools sort -@ 2.

@tdayris
Copy link
Contributor Author

tdayris commented Aug 23, 2023

New tests on the very same 15 human WES mapping shows 380% ± 15% of CPU usage. I do not get kicked anymore. Thank you for your advises.

@johanneskoester johanneskoester merged commit e35e312 into snakemake:master Aug 25, 2023
6 checks passed
johanneskoester pushed a commit that referenced this pull request Aug 25, 2023
🤖 I have created a release \*beep\* \*boop\*
---
##
[2.6.0](https://www.github.com/snakemake/snakemake-wrappers/compare/v2.5.0...v2.6.0)
(2023-08-25)


### Features

* add galah wrapper
([#1754](https://www.github.com/snakemake/snakemake-wrappers/issues/1754))
([083688a](https://www.github.com/snakemake/snakemake-wrappers/commit/083688a059439b1b886ac2db95fd53530d5bef11))
* Enhanced-volcano
([#1521](https://www.github.com/snakemake/snakemake-wrappers/issues/1521))
([0bd316d](https://www.github.com/snakemake/snakemake-wrappers/commit/0bd316db902ed47d36250b4464f5d8710b295a61))
* Immunedeconv
([#1741](https://www.github.com/snakemake/snakemake-wrappers/issues/1741))
([97b5bde](https://www.github.com/snakemake/snakemake-wrappers/commit/97b5bdec2bcff9b26de7e6889cba72521b845e99))


### Bug Fixes

* Bwa mem threads
([#1743](https://www.github.com/snakemake/snakemake-wrappers/issues/1743))
([e35e312](https://www.github.com/snakemake/snakemake-wrappers/commit/e35e31219af8e7bf7b2f7174ddd7ade93abf7cad))


### Performance Improvements

* autobump bio/hifiasm
([#1768](https://www.github.com/snakemake/snakemake-wrappers/issues/1768))
([5795e2c](https://www.github.com/snakemake/snakemake-wrappers/commit/5795e2c31d0d6742908223fb7ff86fb186dd09f5))
* autobump bio/sourmash/compute
([#1767](https://www.github.com/snakemake/snakemake-wrappers/issues/1767))
([412f289](https://www.github.com/snakemake/snakemake-wrappers/commit/412f2892dc44c7218656b23dc6d83cb15e15eae0))
* autobump bio/vg/prune
([#1769](https://www.github.com/snakemake/snakemake-wrappers/issues/1769))
([fe30289](https://www.github.com/snakemake/snakemake-wrappers/commit/fe302896b2550585c257ac0311ed9c5ee462a2dd))
* update datavzrd 2.23.8
([#1764](https://www.github.com/snakemake/snakemake-wrappers/issues/1764))
([2f76671](https://www.github.com/snakemake/snakemake-wrappers/commit/2f766717bbf35dfaf748c02757b4f5eef0ff96ba))
---


This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants