Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed GTDB metadata naming and format #19

Open
apduncan opened this issue Feb 27, 2024 · 2 comments
Open

Changed GTDB metadata naming and format #19

apduncan opened this issue Feb 27, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@apduncan
Copy link

I was attempting to map from NCBI to GTDB taxonomy, when building translation multitax was unable to download GTDB metadata

from multitax import GtdbTx, NcbiTx

ncbi = NcbiTx()
gtdb = GtdbTx()

ncbi.build_translation(gtdb)

Exception: One or more files could not be downloaded: https://data.gtdb.ecogenomic.org/releases/latest/ar53_metadata.tar.gz, https://data.gtdb.ecogenomic.org/releases/latest/bac120_metadata.tar.gz

For r214.1, the metadata is no longer a tarball, appears to be a gzipped tsv: bac120_metadata.tsv.gz, ar53_metadta.tsv.gz. Looks like it would need some different handling in build_translation as well as that extract tar members.

I'd be happy to put together a pull request to fix, if you're interested.

@pirovc pirovc added the bug Something isn't working label Feb 27, 2024
@pirovc
Copy link
Owner

pirovc commented Feb 27, 2024

Thanks for reporting. Indeed they changed a while ago. A PR would be great! You have to update the urls and the parsing procedure, the download_files function should be generalized for the gzip only files. Some day ago I fixed this exact bug in another tool, you can use it as an example.

@apduncan
Copy link
Author

Okay great, will take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants