Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create human-readable taxonomy lookup table from precomputed database #266

Open
cvigilv opened this issue Apr 23, 2024 · 3 comments
Open

Comments

@cvigilv
Copy link

cvigilv commented Apr 23, 2024

I'm currently trying to use foldseek to prepare some datasets and I would like to check if the taxonomic information of Alphafold/Proteome matches the one I obtained from the FTP server of Alphafold.

Is there any way to convert the binary _taxonomy file into a tab-separated value?

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.

Foldssek Output (for bugs)

Please make sure to also post the complete output of Spacepharer. You can use gist.github.com for large output.

Context

Providing context helps us come up with a solution and improve our documentation for the future.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute foldseek without any parameters):
  • Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.):
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • Operating system and version:
@cvigilv cvigilv changed the title Create human-readable taxonomy database from database Create human-readable taxonomy lookup table from precomputed database Apr 23, 2024
@milot-mirdita
Copy link
Member

The easiest workaround for this is probably to use slightly abuse addtaxonomy:

mmseqs databases UniProtKB/Swiss-Prot sprot tmp
MMSEQS_FORCE_MERGE=1 mmseqs addtaxonomy sprot sprot_h out
tr -d '\000' out > sprot_headers_with_taxonomy.tsv

Adding a module that exports the nodes/names taxonomy dmp files, would also be possible, but that would need to come from an external contribution as I don't have time to implement this currently.

@milot-mirdita
Copy link
Member

That also works the same way for foldseek, just use a Foldseek database and the foldseek binary instead of mmseqs.

@cvigilv
Copy link
Author

cvigilv commented May 7, 2024

Thanks! Will give it a test and come back if I encounter any problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants