Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: encode fastq downloader (#1798)
<!-- Ensure that the PR title follows conventional commit style (<type>: <description>)--> <!-- Possible types are here: https://github.com/commitizen/conventional-commit-types/blob/master/index.json --> ### Description Download fastq files directly from the ENCODE project. Added python as a dependency because wrapper **needs** an environment: snakemake/snakemake#1718 ### QC <!-- Make sure that you can tick the boxes below. --> * [x] I confirm that: For all wrappers added by this PR, * there is a test case which covers any introduced changes, * `input:` and `output:` file paths in the resulting rule can be changed arbitrarily, * either the wrapper can only use a single core, or the example rule contains a `threads: x` statement with `x` being a reasonable default, * rule names in the test case are in [snake_case](https://en.wikipedia.org/wiki/Snake_case) and somehow tell what the rule is about or match the tools purpose or name (e.g., `map_reads` for a step that maps reads), * all `environment.yaml` specifications follow [the respective best practices](https://stackoverflow.com/a/64594513/2352071), * wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in `input:` or `output:`), * all fields of the example rules in the `Snakefile`s and their entries are explained via comments (`input:`/`output:`/`params:` etc.), * `stderr` and/or `stdout` are logged correctly (`log:`), depending on the wrapped tool, * temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function `tempfile.gettempdir()` points to (see [here](https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir); this also means that using any Python `tempfile` default behavior works), * the `meta.yaml` contains a link to the documentation of the respective tool or command, * `Snakefile`s pass the linting (`snakemake --lint`), * `Snakefile`s are formatted with [snakefmt](https://github.com/snakemake/snakefmt), * Python wrapper scripts are formatted with [black](https://black.readthedocs.io). * Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays). --------- Co-authored-by: David Laehnemann <david.laehnemann@hhu.de> Co-authored-by: Johannes Köster <johannes.koester@uni-due.de>
- Loading branch information
1 parent
2a50eb0
commit 1cc3e00
Showing
5 changed files
with
188 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
- nodefaults | ||
# There are strictly no dependencies for this wrapper, but the current handling | ||
# of wrappers by snakemake requires an environment.yaml with dependencies. | ||
# Once the underlying issue is fixed, this environment.yaml can be removed. See: | ||
# https://github.com/snakemake/snakemake/issues/1718 | ||
dependencies: | ||
- python >=3.7 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
name: encode_fastq_downloader | ||
authors: | ||
- Maarten van der Sande | ||
description: | | ||
Download fastq files directly from the ENCODE project: https://www.encodeproject.org/ | ||
output: | | ||
A single fastq.gz file for single-ended data and two for paired-ended data. | ||
notes: | | ||
* You can use encode assay accession (ENCSR) and encode file accession (ENCFF). The ENCFF identifier needs to refer to a fastq file. | ||
* When specifying a file accession for paired-end data, always BOTH files are downloaded. The downloaded R1 file is always the R1 file on ENCODE, and vice versa, regardless whether you specify the R1 or R2 file accession. | ||
* When multiple sequencing runs belong to a single assay accession, they are all downloaded and concatenated. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
rule encode_fastq_download_PE: | ||
output: | ||
r1="{accession}_R1.fastq.gz", | ||
r2="{accession}_R2.fastq.gz" | ||
wildcard_constraints: | ||
accession="ENC(SR|FF).+" | ||
log: | ||
"logs/download_fastq_encode/PE_{accession}.log", | ||
wrapper: | ||
"master/bio/encode_fastq_downloader" | ||
|
||
|
||
rule encode_fastq_download_SE: | ||
output: | ||
r1="{accession}.fastq.gz" | ||
wildcard_constraints: | ||
accession="ENC(SR|FF)((?!_R{1,2}).)+" | ||
log: | ||
"logs/download_fastq_encode/SE_{accession}.log", | ||
wrapper: | ||
"master/bio/encode_fastq_downloader" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
import os | ||
import json | ||
import urllib.request | ||
|
||
from snakemake.shell import shell | ||
|
||
|
||
def exception_to_log(check, msg): | ||
log = snakemake.log_fmt_shell(stdout=True, stderr=True) | ||
if not check: | ||
shell(f"""echo "{msg}" {log} """) | ||
# exit without stack trace | ||
os._exit(1) | ||
|
||
|
||
def download_encff(accession, layout, dest): | ||
exception_to_log( | ||
check=accession.startswith("ENCFF"), | ||
msg=f"""Can't download accession "{accession}" directly as it isn't a file. This shouldn't happen..""", | ||
) | ||
url = f"https://www.encodeproject.org/files/{accession}/?format=json" | ||
try: | ||
response = urllib.request.urlopen(urllib.request.Request(url)).read() | ||
except urllib.error.HTTPError: | ||
exception_to_log( | ||
check=False, | ||
msg=f"""Having trouble connecting to ENCODE or the accesion "{accession}" doesn't exist.""", | ||
) | ||
response = json.loads(response.decode("utf-8")) | ||
|
||
exception_to_log( | ||
check=response["file_format"] == "fastq", | ||
msg=f"""Can't download accession "{accession}" directly as it doesn't refer to a fastq file. It is a "{response["file_format"]}" file.""", | ||
) | ||
|
||
exception_to_log( | ||
check=layout in ["single", "paired"], | ||
msg=f"""The layout of the sample is not single ended or paired, but it is "{layout}".""", | ||
) | ||
|
||
if layout == "single": | ||
url = "https://www.encodeproject.org" + response["href"] | ||
shell(f"wget -O - -o /dev/null {url} >> {dest.r1}") | ||
if layout == "paired": | ||
# lookup the mate | ||
mate_accession = response["paired_with"].split("/")[2] | ||
mate_url = f"https://www.encodeproject.org/files/{mate_accession}/?format=json" | ||
mate_response = json.loads( | ||
urllib.request.urlopen(urllib.request.Request(mate_url)) | ||
.read() | ||
.decode("utf-8") | ||
) | ||
|
||
# get the urls to download them | ||
url = "https://www.encodeproject.org" + response["href"] | ||
mate_url = "https://www.encodeproject.org" + mate_response["href"] | ||
|
||
# if the mate is actually R1, swap them so that R1 always corresponds | ||
if response["paired_end"] == "2": | ||
url, mate_url = mate_url, url | ||
|
||
shell(f"wget -O - -o /dev/null {url} >> {dest.r1}") | ||
shell(f"wget -O - -o /dev/null {mate_url} >> {dest.r2}") | ||
|
||
|
||
def download_encsr(accession, layout, dest): | ||
url = f"https://www.encodeproject.org/search/?type=File&dataset=/experiments/{accession}/&file_format=fastq&format=json&frame=object" | ||
try: | ||
response = urllib.request.urlopen(urllib.request.Request(url)).read() | ||
except urllib.error.HTTPError: | ||
exception_to_log( | ||
check=False, | ||
msg=f"""Having trouble connecting to ENCODE or the accesion "{accession}" doesn't exist.""", | ||
) | ||
response = json.loads(response.decode("utf-8")) | ||
|
||
# check if all run types are the same | ||
exception_to_log( | ||
check=len(set([file["run_type"] for file in response["@graph"]])) == 1, | ||
msg=f"""Not all the runs of "{accession} are of the same type: {set([file["run_type"] for file in response["@graph"]])}. It is ambiguous how to proceed.""", | ||
) | ||
inferred_layout = response["@graph"][0]["run_type"] | ||
|
||
if layout == "single": | ||
exception_to_log( | ||
check=inferred_layout == "single-ended", | ||
msg=f"""The sample was automatically inferred to be single-ended, but it is: "{inferred_layout}".""", | ||
) | ||
for encff_accession in response["@graph"]: | ||
encff_accession = encff_accession["accession"] | ||
download_encff(encff_accession, layout, dest) | ||
elif layout == "paired": | ||
exception_to_log( | ||
check=inferred_layout == "paired-ended", | ||
msg=f"""The sample was automatically inferred to be paired-ended, but it is: "{inferred_layout}".""", | ||
) | ||
|
||
# get all the R1s | ||
runs = [x["accession"] for x in response["@graph"] if x["paired_end"] == "1"] | ||
|
||
for run in runs: | ||
download_encff(run, layout, dest) | ||
else: | ||
assert False | ||
|
||
|
||
# determine the layout (single-ended vs paired-ended) | ||
exception_to_log( | ||
check=len(snakemake.output) in [1, 2], | ||
msg=f"""The numer of specified outputs of this rule should be 1 or 2, but it is "{len(snakemake.output)}".""", | ||
) | ||
if len(snakemake.output) == 1: | ||
layout = "single" | ||
exception_to_log( | ||
check=hasattr(snakemake.output, "r1"), | ||
msg=f"""Single-ended data needs to specify its output with r1.""", | ||
) | ||
else: | ||
layout = "paired" | ||
exception_to_log( | ||
check=hasattr(snakemake.output, "r1") and hasattr(snakemake.output, "r2"), | ||
msg=f"""Paired-ended data needs to specify its output with r1 and r2.""", | ||
) | ||
|
||
exception_to_log( | ||
check=snakemake.wildcards.accession.startswith(("ENCFF", "ENCSR")), | ||
msg=f"""The sample accession ({snakemake.wildcards.accession}) should start with ENCFF or ENCSR.""", | ||
) | ||
if snakemake.wildcards.accession.startswith("ENCFF"): | ||
download_encff( | ||
accession=snakemake.wildcards.accession, layout=layout, dest=snakemake.output | ||
) | ||
else: | ||
download_encsr( | ||
accession=snakemake.wildcards.accession, layout=layout, dest=snakemake.output | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters