Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taxonkit equivalent of diamond blastx plus megan blast2lca #86

Open
avilella opened this issue Aug 21, 2023 · 3 comments
Open

taxonkit equivalent of diamond blastx plus megan blast2lca #86

avilella opened this issue Aug 21, 2023 · 3 comments

Comments

@avilella
Copy link

Hi,

I've followed the methods section in this paper:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1299-7

To generate diamond blastx hits from short-read WGS data, and I wanted an equivalent to megan/tools/blast2lca which it seems taxonkit could be that equivalent:

diamond blastx -d/path/to/NCBI_nr/nr -q sample_name.fasta -a sample_name -p 16

MEGAN (v5.10.6) (obtained as described above) was used for read-level taxonomic classification in non-interactive mode:

megan/tools/blast2lca --input sample_name.daa --format BlastTAB --topPercent 10 --gi2taxa megan/GI_Tax_mapping/gi_taxid-March2015X.bin --output sample_name.read_assignments.txt

I have a sample_name.daa, and I wonder if there is a "preferred way" to end up with a list of lca's and the percentage of reads for it.

At the moment, I am doing a table join of the protein hits from diamond view with prot.accession2taxid.gz, and then I guess I can run taxonkit lca on that list of taxids? thx in advance

diamond view --daa AB2_1_A03_002_S5.daa | head -n 100000 > tmp
csvtk join -t -f "2;1" tmp /bfx_nfs/ncbi/nr/prot.accession2taxid.gz > tmp.tax 
taxonkit lca tmp.tax # ?
@shenwei356
Copy link
Owner

Joining with prot.accession2taxid.gz is not a good idea, cause it's so big. Searching it with the 2nd column of tmp first could be a better way.

taxonkit lca can take input of taxids in each line (example).

@avilella
Copy link
Author

avilella commented Aug 21, 2023

I am running this on a machine with 376G of RAM, if there was a guarantee that the join wasn't going to go over it, given the usual size of prot.accession2taxid.gz, that would be fine by me. It seems to be relatively stable at 226G RAM usage on the table join above...

I couldn't make the diamond makedb below work for the nr database as a way of adding the taxids in the diamond blastx output, so I am stuck with finding an efficient way of getting tax ids from the output which look like the entries below:

diamond --version
diamond version 2.1.8

diamond makedb --in nr.gz -d diamondDB/ --taxonmap prot.accession2taxid.gz --taxonnodes taxdmp.zip
VH01324:45:AAC73CJHV:1:1101:20238:1000  KRX29278        79.3    29      6       0       92      6       1       29      1.28e-06        50.4
VH01324:45:AAC73CJHV:1:1101:20655:1000  SFW06984        87.8    49      6       0       149     3       13      61      2.08e-20        90.5
VH01324:45:AAC73CJHV:1:1101:20655:1000  EAA20390        83.7    49      8       0       150     4       41      89      1.13e-16        81.3
VH01324:45:AAC73CJHV:1:1101:25503:1000  EDL29934        85.7    42      6       0       151     26      21      62      1.46e-17        79.0
VH01324:45:AAC73CJHV:1:1101:25503:1000  KRY62511        77.8    45      10      0       151     17      5       49      3.49e-17        77.4
VH01324:45:AAC73CJHV:1:1101:25503:1000  EDL25189        88.4    43      3       1       151     29      34      76      7.52e-17        77.4
VH01324:45:AAC73CJHV:1:1101:25503:1000  XP_006517017    78.7    47      10      0       148     8       178     224     3.12e-16        79.7
VH01324:45:AAC73CJHV:1:1101:25503:1000  XP_006517014    78.7    47      10      0       148     8       286     332     1.04e-15        79.7
VH01324:45:AAC73CJHV:1:1101:25503:1000  EDL34859        83.7    43      7       0       151     23      64      106     3.22e-15        75.1
VH01324:45:AAC73CJHV:1:1101:25503:1000  EEQ62534        83.7    43      7       0       151     23      30      72      4.56e-15        72.8

any ideas would be welcomed.

@shenwei356
Copy link
Owner

# extract taxids of hits
zcat prot.accession2taxid.gz \
    | csvtk grep -Ht -f 1 -P <(cut -f 2 data.tsv) \
    | cut -f 1,3 \
    > hit2taxid.tsv

# join and collect taxids for each query, and then  compute LCA
csvtk join -Ht -f '2;1' <(cut -f 1,2 data.tsv) hit2taxid.tsv \
    | csvtk fold -Ht -f 1 -v 3 -s ';' \
    | taxonkit lca -i 2 -s ';' \
    > query2lca.tsv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants