Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to generate taxonomy4blast.sqlite3 database #158

Open
dhoogest opened this issue Mar 4, 2024 · 4 comments
Open

Add logic to generate taxonomy4blast.sqlite3 database #158

dhoogest opened this issue Mar 4, 2024 · 4 comments
Assignees

Comments

@dhoogest
Copy link
Collaborator

dhoogest commented Mar 4, 2024

In the BLAST+ manual, there are notes about support for exending the --{negative}-taxids CLI prompts to allow for filtering by non-leaf tax nodes. Support for this functionality appears to require a file called taxonomy4blast.sqlite3 alongside the blast datatabase binaries. It would be great if taxtastic could be extended to leverage logic for defining taxonomic lineages while exporting this specific shape, in order to facilitate builds of custom databases with bespoke taxonomies.

/cc @nhoffman @crosenth

@crosenth crosenth self-assigned this Mar 4, 2024
@crosenth
Copy link
Member

For now, string based bespoke taxonomies "UW###" will not work:

% makeblastdb -dbtype nucl -in seqs.fa -out blast -parse_seqids -taxid_map seqmap.txt

Building a new DB, current time: 03/13/2024 15:20:54
New DB name:   /blast
New DB title:  output/seqs.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Error: NCBI C++ Exception:
    T0 "/home/coremake/release_build/build/PrepareRelease_Linux64-Centos_JSID_01_880026_130.14.18.128_9008__PrepareRelease_Linux64-Centos_1697736677/c++/compilers/unix/../../src/corelib/ncbistr.cpp", line 862: Error: (CStringException::eConvert) ncbi::NStr::StringToInt8() - Cannot convert string 'UW123' to Int8 (m_Pos = 0)

But I will write an email to NCBI arguing tax_ids are not Int8s and see if they will change their dtype rule

@crosenth
Copy link
Member

Hi,
 
Thanks for following up.
 
Unfortunately, this operation is based on NCBI taxonomy database and how species are ID'ed. It also involves our internal workflow, so we cannot entertain this request.
 
Your understanding over this will be appreciated.
 
Regards,
 
Tao Tao, PhD
NCBI User Services
[https://go.usa.gov/x647S](https://urldefense.com/v3/__https://go.usa.gov/x647S__;!!K-Hz7m0Vt54!h8HZ1wIAeUMqTNcKvfn0qBHmJAcDCR6Hy49azKKshYBVwHruWdlamlWVA3GXB4BbSyU-vKeO5GMsEJWNLhk251no$)
------------------- Original Message -------------------
From: Chris Rosenthal <crosenth@uw.edu>;
Received: Thu Mar 14 2024 13:49:48 GMT-0400 (Eastern Daylight Time)
To: nlm-support@nlm.nih.gov <nlm-support@nlm.nih.gov>; NLM Support <nlm-support@nlm.nih.gov>; Triage Team <nlm-support@nlm.nih.gov>;
Subject: [EXTERNAL] Re: case #CAS-1281624-K2J1F7: makeblastdb requires tax_ids to be Int8 TRACKING:000412000016044

Hi Tao Tao,
 
Can I make a request for BLAST+ tools to support non-numeric, string taxonomy identifiers?  This would allow users to utilize Blast tools using custom taxonomy identifiers like 'UW123".  Please consider, taxonomy identifiers are used for identification purposes and are not meant to reflect numerical values, such as age, weight or quantity.
 
Thanks

@dhoogest
Copy link
Collaborator Author

Snappy response, if a bummer. I think there are still use cases for this taxtastic functionality, for databases which do not include non-NCBI taxa (such as ya16sdb).

@crosenth
Copy link
Member

I suspect they want tax_id numerical for db performance purposes instead of adding another unique indexing column to their internal db schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants