Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taxids created with create-taxdump skip numbers #59

Open
apcamargo opened this issue May 20, 2022 · 2 comments
Open

taxids created with create-taxdump skip numbers #59

apcamargo opened this issue May 20, 2022 · 2 comments

Comments

@apcamargo
Copy link
Contributor

apcamargo commented May 20, 2022

When you create a taxdump using create-taxdump (ICTV taxonomy, for example), the taxids "skip" some numbers. For example:

$ head ictv-taxdump/names.dmp
1	|	root	|		|	scientific name	|
287205	|	Hoswirudivirus MRV1	|		|	scientific name	|
287935	|	Shomudavirus limadaptatum	|		|	scientific name	|
1096518	|	Sclerotimonavirus betaclarireediae	|		|	scientific name	|
1138752	|	Potato virus H	|		|	scientific name	|
1536674	|	Rhopapillomavirus 1	|		|	scientific name	|
1845995	|	Monomorium pharaonis virus 1	|		|	scientific name	|
1890985	|	Aquamavirus A	|		|	scientific name	|
2079526	|	Hylipavirus	|		|	scientific name	|
2290567	|	Fattrevirus	|		|	scientific name	|

This is not a problem in itself, as the nodes are still connected. However, this causes a bug when you try to create a MMSeqs2 taxonomy database using the custom taxonomy, as it apparently assumes that numbers are not skipped (unless they are in delnodes.dmp and merged.dmp, I guess).

I wrote a script that mapped taxids such that no number is skipped and it solved the issue.

$ head ictv-taxdump/names.dmp
1	|	root	|		|	scientific name	|
2	|	Hoswirudivirus MRV1	|		|	scientific name	|
3	|	Shomudavirus limadaptatum	|		|	scientific name	|
4	|	Sclerotimonavirus betaclarireediae	|		|	scientific name	|
5	|	Potato virus H	|		|	scientific name	|
6	|	Rhopapillomavirus 1	|		|	scientific name	|
7	|	Monomorium pharaonis virus 1	|		|	scientific name	|
8	|	Aquamavirus A	|		|	scientific name	|
9	|	Hylipavirus	|		|	scientific name	|
10	|	Fattrevirus	|		|	scientific name	|

This is not a TaxonKit bug in any way. But because MMSeqs2 is pretty popular, I thought it was best to report this here in case anyone else faces the same issue.

@shenwei356
Copy link
Owner

Yes, NCBI taxonomy uses consecutive numbers too. I guess they have a mapping table to maintain these relationships.

@apcamargo
Copy link
Contributor Author

For reference, this is the script I used to make the taxids sequential: https://github.com/apcamargo/ictv-mmseqs2-protein-database/blob/master/fix_taxdump.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants