New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: encode fastq downloader #1798
feat: encode fastq downloader #1798
Conversation
@dlaehnemann could you take a look if this is what you had in mind for the seq2science review? |
…ke-wrappers into encode_download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is exactly what I was thinking of. This is really great, many thanks!
I already have some small suggestions, but will have a more thorough tomorrow.
"{accession}_R1.fastq.gz", | ||
"{accession}_R2.fastq.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could work with named outputs here, to avoid relying on the order of specified outputs. Here, I would suggest:
"{accession}_R1.fastq.gz", | |
"{accession}_R2.fastq.gz" | |
r1="{accession}_R1.fastq.gz", | |
r2="{accession}_R2.fastq.gz" |
Below, for the single-end case, I would then suggest something like se=
or r0=
. Then, you can check for the presence of r1
and r2
, or se
/r0
in snakemake.output
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what to call the single-end named outputs. Based on Illumina's explanation I'm thinking to just name them r1, and r1 + r2 for paired-end: https://knowledge.illumina.com/software/general/software-general-reference_material-list/000002211
For a single-read run, one Read 1 (R1) FASTQ file is created for each sample per flow cell lane. For a paired-end run, one R1 and one Read 2 (R2) FASTQ file is created for each sample for each lane. FASTQ files are compressed and created with the extension *.fastq.gz.
Not entirely convinced of this, but perhaps better than thinking of our own format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, from the general logic of Illumina, single-end reads simply sequence read1
. se
and r0
are just things I have seen in the wild and never questioned. But it sounds better and more consistent to go for your suggestion:
r1=
only for single-endr1=
andr2=
for paired-end
And you can still clearly introduce logic for handling both cases (and throwing an error if neither is the case). Thanks for being so thorough!
Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>
…ke-wrappers into encode_download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just another mini-review of this whole environment.yaml
conundrum...
Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>
Co-authored-by: David Laehnemann <david.laehnemann@hhu.de>
This way, logs get put into a subdirectory and we avoid filling up the main `logs/` directory with too many log files.
🤖 I have created a release \*beep\* \*boop\* --- ## [2.9.0](https://www.github.com/snakemake/snakemake-wrappers/compare/v2.8.0...v2.9.0) (2023-10-30) ### Features * CNV Facets ([#1773](https://www.github.com/snakemake/snakemake-wrappers/issues/1773)) ([74f5e4a](https://www.github.com/snakemake/snakemake-wrappers/commit/74f5e4a72ebb3abed014380314e63ca3db9f36f4)) * encode fastq downloader ([#1798](https://www.github.com/snakemake/snakemake-wrappers/issues/1798)) ([1cc3e00](https://www.github.com/snakemake/snakemake-wrappers/commit/1cc3e00c6bbb3761d1ffd07b26acd18a1caa746d)) * for bwa, auto infer block size, extra tests, code cleanup and add docs ([#1774](https://www.github.com/snakemake/snakemake-wrappers/issues/1774)) ([66940e3](https://www.github.com/snakemake/snakemake-wrappers/commit/66940e3c69e1a06a6e9b771d10e29b9eb03d9f24)) * Gseapy ([#1822](https://www.github.com/snakemake/snakemake-wrappers/issues/1822)) ([2a50eb0](https://www.github.com/snakemake/snakemake-wrappers/commit/2a50eb0b3567843f0082496f84999d1a9a08e2ab)) * unaligned bam input support for minimap2 alignment ([#1863](https://www.github.com/snakemake/snakemake-wrappers/issues/1863)) ([76280a5](https://www.github.com/snakemake/snakemake-wrappers/commit/76280a592677e81dc092c66351bc6eb7801da172)) ### Bug Fixes * for nonpareil, use pigz and pbzip2 and auto infer of -X ([#1776](https://www.github.com/snakemake/snakemake-wrappers/issues/1776)) ([45860bf](https://www.github.com/snakemake/snakemake-wrappers/commit/45860bfc1a1509311182f7057f4b7a6210be0423)) * moving to utils ([#1770](https://www.github.com/snakemake/snakemake-wrappers/issues/1770)) ([b5c0c01](https://www.github.com/snakemake/snakemake-wrappers/commit/b5c0c016b6a3c9c46672d5e5ee13bda934cbb970)) ### Performance Improvements * autopin bio/bwa/mem ([#1907](https://www.github.com/snakemake/snakemake-wrappers/issues/1907)) ([99e9f60](https://www.github.com/snakemake/snakemake-wrappers/commit/99e9f604eba4e77c4b3f69cad0e25114c72ff1fd)) * autopin bio/multiqc ([#1906](https://www.github.com/snakemake/snakemake-wrappers/issues/1906)) ([6c67666](https://www.github.com/snakemake/snakemake-wrappers/commit/6c676668b49210d8e99bec6948003421528ac5c4)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
Download fastq files directly from the ENCODE project.
Added python as a dependency because wrapper needs an environment: snakemake/snakemake#1718
QC
For all wrappers added by this PR,
input:
andoutput:
file paths in the resulting rule can be changed arbitrarily,threads: x
statement withx
being a reasonable default,map_reads
for a step that maps reads),environment.yaml
specifications follow the respective best practices,input:
oroutput:
),Snakefile
s and their entries are explained via comments (input:
/output:
/params:
etc.),stderr
and/orstdout
are logged correctly (log:
), depending on the wrapped tool,tempfile.gettempdir()
points to (see here; this also means that using any Pythontempfile
default behavior works),meta.yaml
contains a link to the documentation of the respective tool or command,Snakefile
s pass the linting (snakemake --lint
),Snakefile
s are formatted with snakefmt,