Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing the ENA, DRA other databases? #2

Open
GeoMicroSoares opened this issue Feb 8, 2023 · 2 comments
Open

Indexing the ENA, DRA other databases? #2

GeoMicroSoares opened this issue Feb 8, 2023 · 2 comments

Comments

@GeoMicroSoares
Copy link

Hey there - I was just wondering if you have plans to index other databases for metagenomics reads as I feel our lab's meta-analysis efforts have definitely been biased towards the SRA. We've already had really cool results using this tool so again congratulations on developing it!!

@luizirber
Copy link
Member

luizirber commented Feb 8, 2023

ENA/DRA/SRA are mirrored, so they are already available (pending their own sync, and me downloading and indexing the datasets 😂 )

Examples:
DRA: https://wort.sourmash.bio/view/sra/DRR013902/
ERA: https://wort.sourmash.bio/view/sra/ERR2286070/

I'll start another mastiff index update this week, just finished processing all metagenomes since last index update in August.


As for other data sources, I have some datasets from IMG (from an old collaboration) and NCBI assemblies (Genbank and RefSeq), but they are not indexed with mastiff. I update the NCBI assemblies weekly in wort, but for JGI I need to figure how to download data.

My 'barrier' for adding new data sources is: how easy is to grab metadata for what was updated recently, and how to download data from it? I figured that out for SRA and NCBI assemblies, but would love suggestion/pointers for how to do it in other databases too =]

bonus points if they don't need extra software for downloading the data; I ended up installing sra-toolkit for SRA downloads, but NCBI assemblies is just a curl piped into sourmash: https://github.com/sourmash-bio/wort/blob/19ff5e32d8f8f1087ca8186877c71f2d0f13657b/wort/blueprints/compute/tasks.py )

extra bonus points if they can stream the data, instead of forcing a local copy before I can sketch them =]

@GeoMicroSoares
Copy link
Author

Hi again - thanks for the answer and for the heads-up on the database update! I'll be on the lookout for that and will rerun our analyses to see if there's any novelties of relevance with the new update. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants