feat: encode fastq downloader #1798

Maarten-vd-Sande · 2023-09-12T11:38:34Z

Description

Download fastq files directly from the ENCODE project.

Added python as a dependency because wrapper needs an environment: snakemake/snakemake#1718

QC

I confirm that:

For all wrappers added by this PR,

there is a test case which covers any introduced changes,
input: and output: file paths in the resulting rule can be changed arbitrarily,
either the wrapper can only use a single core, or the example rule contains a threads: x statement with x being a reasonable default,
rule names in the test case are in snake_case and somehow tell what the rule is about or match the tools purpose or name (e.g., map_reads for a step that maps reads),
all environment.yaml specifications follow the respective best practices,
wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:),
all fields of the example rules in the Snakefiles and their entries are explained via comments (input:/output:/params: etc.),
stderr and/or stdout are logged correctly (log:), depending on the wrapped tool,
temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to (see here; this also means that using any Python tempfile default behavior works),
the meta.yaml contains a link to the documentation of the respective tool or command,
Snakefiles pass the linting (snakemake --lint),
Snakefiles are formatted with snakefmt,
Python wrapper scripts are formatted with black.
Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

Maarten-vd-Sande · 2023-09-12T11:40:08Z

@dlaehnemann could you take a look if this is what you had in mind for the seq2science review?

…ke-wrappers into encode_download

dlaehnemann

Yes, this is exactly what I was thinking of. This is really great, many thanks!

I already have some small suggestions, but will have a more thorough tomorrow.

dlaehnemann · 2023-09-12T18:25:35Z

bio/encode_fastq_downloader/test/Snakefile

+        "{accession}_R1.fastq.gz",
+        "{accession}_R2.fastq.gz"


I think you could work with named outputs here, to avoid relying on the order of specified outputs. Here, I would suggest:

Suggested change

"{accession}_R1.fastq.gz",

"{accession}_R2.fastq.gz"

r1="{accession}_R1.fastq.gz",

r2="{accession}_R2.fastq.gz"

Below, for the single-end case, I would then suggest something like se= or r0=. Then, you can check for the presence of r1 and r2, or se/r0 in snakemake.output.

I'm not sure what to call the single-end named outputs. Based on Illumina's explanation I'm thinking to just name them r1, and r1 + r2 for paired-end: https://knowledge.illumina.com/software/general/software-general-reference_material-list/000002211

For a single-read run, one Read 1 (R1) FASTQ file is created for each sample per flow cell lane. For a paired-end run, one R1 and one Read 2 (R2) FASTQ file is created for each sample for each lane. FASTQ files are compressed and created with the extension *.fastq.gz.

Not entirely convinced of this, but perhaps better than thinking of our own format?

You are right, from the general logic of Illumina, single-end reads simply sequence read1. se and r0 are just things I have seen in the wild and never questioned. But it sounds better and more consistent to go for your suggestion:

r1= only for single-end

r1= and r2= for paired-end

And you can still clearly introduce logic for handling both cases (and throwing an error if neither is the case). Thanks for being so thorough!

bio/encode_fastq_downloader/environment.yaml

bio/encode_fastq_downloader/wrapper.py

Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>

…ke-wrappers into encode_download

dlaehnemann

Just another mini-review of this whole environment.yaml conundrum...

bio/encode_fastq_downloader/environment.yaml

Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>

This way, logs get put into a subdirectory and we avoid filling up the main `logs/` directory with too many log files.

Maarten-vd-Sande added 2 commits September 12, 2023 13:37

encode fastq downloader

54fd7ef

update meta

7bf2991

Maarten-vd-Sande and others added 10 commits September 12, 2023 13:40

smaller test

3837a23

black

3112a4b

add some requirements for the test...

d89c0c2

newline

f65b66a

fix mistake

74ea56b

Merge branch 'encode_download' of github.com:Maarten-vd-Sande/snakema…

6c4b76e

…ke-wrappers into encode_download

fix mistakes

d8ae4d3

better error handling

d3336cc

black again :)

21c64aa

Merge branch 'snakemake:master' into encode_download

f9434e9

dlaehnemann reviewed Sep 12, 2023

View reviewed changes

Maarten-vd-Sande and others added 6 commits September 13, 2023 09:27

remove environment.yaml

8e40df1

Update bio/encode_fastq_downloader/wrapper.py

ba96806

Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>

se vs pe and empty env.yaml

676a070

Merge branch 'encode_download' of github.com:Maarten-vd-Sande/snakema…

c208cdf

…ke-wrappers into encode_download

Update environment.yaml

ec522fa

Update environment.yaml

baa781d

dlaehnemann reviewed Sep 13, 2023

View reviewed changes

bio/encode_fastq_downloader/environment.yaml Outdated Show resolved Hide resolved

bio/encode_fastq_downloader/environment.yaml Outdated Show resolved Hide resolved

Maarten-vd-Sande and others added 2 commits September 13, 2023 11:24

Update bio/encode_fastq_downloader/environment.yaml

9c3a375

Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>

Update bio/encode_fastq_downloader/environment.yaml

a2223fb

Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>

Maarten-vd-Sande marked this pull request as ready for review September 13, 2023 10:03

Maarten-vd-Sande and others added 2 commits September 16, 2023 10:27

Merge branch 'master' into encode_download

7e7808e

create logs/ subdirectory in example rules

1b74ead

This way, logs get put into a subdirectory and we avoid filling up the main `logs/` directory with too many log files.

johanneskoester approved these changes Oct 26, 2023

View reviewed changes

johanneskoester added 2 commits October 26, 2023 10:49

Merge branch 'master' into encode_download

de70c6b

Update test.py

775a2c0

johanneskoester merged commit 1cc3e00 into snakemake:master Oct 30, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: encode fastq downloader #1798

feat: encode fastq downloader #1798

Maarten-vd-Sande commented Sep 12, 2023 •

edited

Maarten-vd-Sande commented Sep 12, 2023

dlaehnemann left a comment

dlaehnemann Sep 12, 2023

Maarten-vd-Sande Sep 15, 2023

dlaehnemann Sep 15, 2023

dlaehnemann left a comment

feat: encode fastq downloader #1798

feat: encode fastq downloader #1798

Conversation

Maarten-vd-Sande commented Sep 12, 2023 • edited

Description

QC

Maarten-vd-Sande commented Sep 12, 2023

dlaehnemann left a comment

Choose a reason for hiding this comment

dlaehnemann Sep 12, 2023

Choose a reason for hiding this comment

Maarten-vd-Sande Sep 15, 2023

Choose a reason for hiding this comment

dlaehnemann Sep 15, 2023

Choose a reason for hiding this comment

dlaehnemann left a comment

Choose a reason for hiding this comment

Maarten-vd-Sande commented Sep 12, 2023 •

edited