Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: encode fastq downloader #1798

Merged
merged 24 commits into from Oct 30, 2023

Conversation

Maarten-vd-Sande
Copy link
Contributor

@Maarten-vd-Sande Maarten-vd-Sande commented Sep 12, 2023

Description

Download fastq files directly from the ENCODE project.

Added python as a dependency because wrapper needs an environment: snakemake/snakemake#1718

QC

  • I confirm that:

For all wrappers added by this PR,

  • there is a test case which covers any introduced changes,
  • input: and output: file paths in the resulting rule can be changed arbitrarily,
  • either the wrapper can only use a single core, or the example rule contains a threads: x statement with x being a reasonable default,
  • rule names in the test case are in snake_case and somehow tell what the rule is about or match the tools purpose or name (e.g., map_reads for a step that maps reads),
  • all environment.yaml specifications follow the respective best practices,
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:),
  • all fields of the example rules in the Snakefiles and their entries are explained via comments (input:/output:/params: etc.),
  • stderr and/or stdout are logged correctly (log:), depending on the wrapped tool,
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to (see here; this also means that using any Python tempfile default behavior works),
  • the meta.yaml contains a link to the documentation of the respective tool or command,
  • Snakefiles pass the linting (snakemake --lint),
  • Snakefiles are formatted with snakefmt,
  • Python wrapper scripts are formatted with black.
  • Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

@Maarten-vd-Sande
Copy link
Contributor Author

@dlaehnemann could you take a look if this is what you had in mind for the seq2science review?

Copy link
Contributor

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is exactly what I was thinking of. This is really great, many thanks!

I already have some small suggestions, but will have a more thorough tomorrow.

Comment on lines 3 to 4
"{accession}_R1.fastq.gz",
"{accession}_R2.fastq.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could work with named outputs here, to avoid relying on the order of specified outputs. Here, I would suggest:

Suggested change
"{accession}_R1.fastq.gz",
"{accession}_R2.fastq.gz"
r1="{accession}_R1.fastq.gz",
r2="{accession}_R2.fastq.gz"

Below, for the single-end case, I would then suggest something like se= or r0=. Then, you can check for the presence of r1 and r2, or se/r0 in snakemake.output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to call the single-end named outputs. Based on Illumina's explanation I'm thinking to just name them r1, and r1 + r2 for paired-end: https://knowledge.illumina.com/software/general/software-general-reference_material-list/000002211

For a single-read run, one Read 1 (R1) FASTQ file is created for each sample per flow cell lane. For a paired-end run, one R1 and one Read 2 (R2) FASTQ file is created for each sample for each lane. FASTQ files are compressed and created with the extension *.fastq.gz.

Not entirely convinced of this, but perhaps better than thinking of our own format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, from the general logic of Illumina, single-end reads simply sequence read1. se and r0 are just things I have seen in the wild and never questioned. But it sounds better and more consistent to go for your suggestion:

  • r1= only for single-end
  • r1= and r2= for paired-end

And you can still clearly introduce logic for handling both cases (and throwing an error if neither is the case). Thanks for being so thorough!

bio/encode_fastq_downloader/environment.yaml Outdated Show resolved Hide resolved
bio/encode_fastq_downloader/wrapper.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just another mini-review of this whole environment.yaml conundrum...

bio/encode_fastq_downloader/environment.yaml Outdated Show resolved Hide resolved
bio/encode_fastq_downloader/environment.yaml Outdated Show resolved Hide resolved
Maarten-vd-Sande and others added 2 commits September 13, 2023 11:24
Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>
Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>
@Maarten-vd-Sande Maarten-vd-Sande marked this pull request as ready for review September 13, 2023 10:03
Maarten-vd-Sande and others added 2 commits September 16, 2023 10:27
This way, logs get put into a subdirectory and we avoid filling up the main `logs/` directory with too many log files.
@johanneskoester johanneskoester merged commit 1cc3e00 into snakemake:master Oct 30, 2023
6 checks passed
johanneskoester pushed a commit that referenced this pull request Oct 30, 2023
🤖 I have created a release \*beep\* \*boop\*
---
##
[2.9.0](https://www.github.com/snakemake/snakemake-wrappers/compare/v2.8.0...v2.9.0)
(2023-10-30)


### Features

* CNV Facets
([#1773](https://www.github.com/snakemake/snakemake-wrappers/issues/1773))
([74f5e4a](https://www.github.com/snakemake/snakemake-wrappers/commit/74f5e4a72ebb3abed014380314e63ca3db9f36f4))
* encode fastq downloader
([#1798](https://www.github.com/snakemake/snakemake-wrappers/issues/1798))
([1cc3e00](https://www.github.com/snakemake/snakemake-wrappers/commit/1cc3e00c6bbb3761d1ffd07b26acd18a1caa746d))
* for bwa, auto infer block size, extra tests, code cleanup and add docs
([#1774](https://www.github.com/snakemake/snakemake-wrappers/issues/1774))
([66940e3](https://www.github.com/snakemake/snakemake-wrappers/commit/66940e3c69e1a06a6e9b771d10e29b9eb03d9f24))
* Gseapy
([#1822](https://www.github.com/snakemake/snakemake-wrappers/issues/1822))
([2a50eb0](https://www.github.com/snakemake/snakemake-wrappers/commit/2a50eb0b3567843f0082496f84999d1a9a08e2ab))
* unaligned bam input support for minimap2 alignment
([#1863](https://www.github.com/snakemake/snakemake-wrappers/issues/1863))
([76280a5](https://www.github.com/snakemake/snakemake-wrappers/commit/76280a592677e81dc092c66351bc6eb7801da172))


### Bug Fixes

* for nonpareil, use pigz and pbzip2 and auto infer of -X
([#1776](https://www.github.com/snakemake/snakemake-wrappers/issues/1776))
([45860bf](https://www.github.com/snakemake/snakemake-wrappers/commit/45860bfc1a1509311182f7057f4b7a6210be0423))
* moving to utils
([#1770](https://www.github.com/snakemake/snakemake-wrappers/issues/1770))
([b5c0c01](https://www.github.com/snakemake/snakemake-wrappers/commit/b5c0c016b6a3c9c46672d5e5ee13bda934cbb970))


### Performance Improvements

* autopin bio/bwa/mem
([#1907](https://www.github.com/snakemake/snakemake-wrappers/issues/1907))
([99e9f60](https://www.github.com/snakemake/snakemake-wrappers/commit/99e9f604eba4e77c4b3f69cad0e25114c72ff1fd))
* autopin bio/multiqc
([#1906](https://www.github.com/snakemake/snakemake-wrappers/issues/1906))
([6c67666](https://www.github.com/snakemake/snakemake-wrappers/commit/6c676668b49210d8e99bec6948003421528ac5c4))
---


This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants