Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request : add taxon id for each blast hit #7

Open
cparsania opened this issue Nov 19, 2019 · 3 comments
Open

Feature request : add taxon id for each blast hit #7

cparsania opened this issue Nov 19, 2019 · 3 comments

Comments

@cparsania
Copy link

Default blast tabular format output (outfmt 7) doesn't add taxon id for each blast hit. Taxon id is very important for downstream phylogenetic analysis. Indirect approach to add taxon id is to run the blastdbcmd with option %T once the results are obtained. This is very time consuming as you have to get taxon first and map back to original blast results. Can metablstr has function which can map taxon id to blast outcome ?

@HajkD
Copy link
Member

HajkD commented Nov 20, 2019

Hi @cparsania

Many thanks for contacting me and I very much appreciate your feedback.

Would it be possible to be more specific where you miss the taxonid information?
Is it when BLASTing e.g. against NCBI nr or when using metablastr::blast_genomes()?
Because in any other scenario the scientific name of the species is given when BLASTing against a genome.

I will then see what I can do.

Many thanks,
Hajk

@cparsania
Copy link
Author

Yes, you are right. BLAST gives subject scientific names but not taxon id. taxon id is required, for example if you want to assign specific taxonomy rank (e.g. family, class, genus, kingdom, superkingdom etc. ) to given species.

After I raised this issue here, I found an R package taxize which actually solve the problem. In that package, there is a function called taxize::genbank2uid() which gives NCBI taxonomy id for a given genebank id.

Below is the wrapper function I wrote which just reformat output of taxize::genbank2uid() and return as a tbl

#' Wrapper function around taxize::genbank2uid.
#'
#' Given a genBank accession alphanumeric string, or a gi numeric string \code{(x)}, it returns tibble of taxid, name and other columns.
#' @param x vector of genBank accession alphanumeric string, or a gi numeric string \code{(x)}.
#' @param ... other parameters to be passed to \code{taxize::genbank2uid}
#'
#' @return a tbl with colnames x, taxid, class, match, multiple_matches, pattern_match, uri, name
#' @export
#' @importFrom taxize genbank2uid
#' @importFrom tibble tibble
#' @importFrom dplyr bind_cols
#' @importFrom purrr map_df
#' @examples
#' \dontrun{
#' x <- c("XP_022900619.1", "XP_022900618.1", "XP_018333511.1", "XP_018573075.1")
#' genbank2uid_tbl(x = x)
#' }
genbank2uid_tbl <- function(x , ...){

        #start_time <- lubridate::now()
        uid_list <- taxize::genbank2uid(x ,  ...)
        uid_tbl <- tibble::tibble(x = x, taxid = unlist(uid_list)) %>%
                dplyr::bind_cols( purrr::map_df(uid_list , attributes))
        time_taken <- start_time - lubridate::now()
        #cat_green_tick("done. ", " Time taken " , time_taken)
        return(uid_tbl)

}


@HajkD
Copy link
Member

HajkD commented Nov 26, 2019

Hi @cparsania

Excellent. I will have a look at how to best integrate this taxonomy information into the BLAST output.

Many thanks,
Hajk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants