Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to download more than 2 FastQ files via FTP and Aspera #260

Open
drpatelh opened this issue Jan 30, 2024 · 3 comments
Open

Add ability to download more than 2 FastQ files via FTP and Aspera #260

drpatelh opened this issue Jan 30, 2024 · 3 comments
Labels
enhancement Improvement for existing functionality

Comments

@drpatelh
Copy link
Member

drpatelh commented Jan 30, 2024

Description of feature

As raised in #259 (comment) and #259 (comment) we need to revisit why we restricted downloading a max of 2 FastQ files via FTP and Aspera.

I vaguely remember this was added because they may have been discrepancies in some files have 3 FastQ files but only 2 md5sum files which broke the pipeline. We need to find some examples of database ids that have 3 FastQ files and take a proper look to see if we can accommodate them in the pipeline.

If you do have more than 2 FastQ files e.g. single-cell data like in this issue #144 then you should be able to retrieve these by using the --force_sratools_download parameter.

@drpatelh drpatelh added the enhancement Improvement for existing functionality label Jan 30, 2024
@drpatelh
Copy link
Member Author

Using the example id in #144 if we run the pipeline with default options the ENA API only returns 1 FastQ file to download:

run_accession	experiment_accession	sample_accession	secondary_sample_accession	study_accession	secondary_study_accession	submission_accession	run_alias	experiment_alias	sample_alias	study_alias	library_layout	library_selection	library_source	library_strategy	library_name	instrument_model	instrument_platform	base_count	read_count	tax_id	scientific_name	sample_title	experiment_title	study_title	sample_description	fastq_md5	fastq_bytes	fastq_ftp	fastq_galaxy	fastq_aspera
SRR9320616	SRX6088086	SAMN12086751	SRS4989433	PRJNA549480	SRP201778	SRA900583	GSM3895942_r1	GSM3895942	GSM3895942	GSE132901	PAIRED	cDNA	TRANSCRIPTOMIC	RNA-Seq		Illumina HiSeq 2500	ILLUMINA	11857688850	120996825	10090	Mus musculus	Old 3 Kidney	Illumina HiSeq 2500 sequencing: GSM3895942: Old 3 Kidney Mus musculus RNA-Seq	A murine aging cell atlas reveals cell identity and tissue-specific trajectories of aging	Old 3 Kidney	98c939bbae1a1fcf9624905516485b67	7763114613	ftp.sra.ebi.ac.uk/vol1/fastq/SRR932/006/SRR9320616/SRR9320616.fastq.gz	ftp.sra.ebi.ac.uk/vol1/fastq/SRR932/006/SRR9320616/SRR9320616.fastq.gz	fasp.sra.ebi.ac.uk:/vol1/fastq/SRR932/006/SRR9320616/SRR9320616.fastq.gz

However, this sample has 2 additional FastQ files that are flagged as technical and can only be obtained by running sra-tools.

fasterq-dump --threads 6 --split-files --include-technical SRR9320616 --outfile SRR9320616.fastq --progress

SRR9320616_1.fastq
SRR9320616_2.fastq
SRR9320616_3.fastq

This indicates that there is a discrepancy between the read data hosted via the ENA API and what can actually be fetched from sra-tools, where the latter seems to be the source of truth. As a result, it is recommended to use this pipeline with --force_sratools_download whenever you anticipate that you have more than 2 FastQ files per sample. I will update the docs accordingly.

@adamrtalbot
Copy link
Contributor

So it seems like the ENA API is wrong and we should be avoiding it. We could flip the logic to be --force_ena_download?

@drpatelh
Copy link
Member Author

So it seems like the ENA API is wrong and we should be avoiding it. We could flip the logic to be --force_ena_download?

Well, in most cases, it's actually fine. Problem with flipping this is that you now start battering storage with .sra files on top of the FastQ files you need to download already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

2 participants