Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Import Blast tabular with comment lines #88

Open
ConYel opened this issue Jun 5, 2021 · 0 comments
Open

[feature request] Import Blast tabular with comment lines #88

ConYel opened this issue Jun 5, 2021 · 0 comments

Comments

@ConYel
Copy link

ConYel commented Jun 5, 2021

Hello and thank you for this impressive package!
As I'm working with some genomic ranges lately I would like to include some results from BLAST.
At first, I thought that it would be quite easy to include the results as granges but until now I haven't found a function that imports a blast resulting file in tabular with comment lines format.

For the implementation, I used an example (with some changes) from here:
In my case I use docker to run BLAST, you can find all the corresponding code below:

Context

Import BLAST output as Genomic Ranges

Code

Run docker, download database and example, output the result to current directory

docker run --rm -ti -v $(pwd):/home/my_data  ncbi/blast

curl -O https://raw.githubusercontent.com/mhahsler/rBLAST/master/inst/examples/RNA_example.fasta

curl -O https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz

mkdir 16SMicrobialDB

tar -zxvf 16S_ribosomal_RNA.tar.gz  -C 16SMicrobialDB

blastn -db 16SMicrobialDB/16S_ribosomal_RNA -query RNA_example.fasta -out "/home/my_data/results_blast.txt" -outfmt "7 delim=@ qseqid qlen sseqid slen qstart qend sstart send qseq sseq evalue bitscore score length pident nident mismatch positive ppos frames qframe sframe sstrand qcovs qcovhsp qcovus" -ungapped -num_threads 2

exit

Import results, check lengths of Genomic Ranges that are correct and provide a pseudo-function for later development.

suppressPackageStartupMessages({
library(tidyverse)
library(plyranges)
library(vroom)}
)

## find how many fields exist for the column names
blast_file <- file.path("results_blast.txt")

fourth_line_fields <-  blast_file %>% 
  vroom_lines() %>% 
  .[4] %>% 
  str_remove("# Fields: ") %>% 
  str_split(", ") %>% 
  unlist %>% 
  str_replace_all(" |/","_") %>% 
  str_replace("%","percent") %>% 
  str_remove("\\.")
  
## create basic columns for the creation of a GR object
blast_sequences <- blast_file %>% 
  read_delim(delim = "@", # here it should be made clear that the delim could be something else
             comment = "#", 
             col_names = fourth_line_fields) %>% 
  select(ends_with(c("start","end","seq","length","strand","subject_id"))) %>% 
  mutate(strand = if_else(subject_strand == "minus","-","+") %>% as_factor(),
         start = if_else(strand == "-", s_end, s_start),
         end = if_else(strand == "+", s_start, s_end),
         seqnames = subject_id)

## check if lengths are correct
blast_sequences %>% 
  mutate(qGR_len = abs(q_end-q_start )+1,
         sGR_len = abs(end-start)+1,
         qseq_len= str_length(query_seq), 
         sseq_len = str_length(subject_seq )) %>% 
  mutate(isittrue = qGR_len == sGR_len & qseq_len == sseq_len & qGR_len == sseq_len) %>% 
  count(isittrue)

#transform from tibble to Granges and then to tibble, check if lengths are correct again
blast_sequences %>% 
  mutate(qGR_len = abs(q_end-q_start )+1,
         sGR_len = abs(end-start)+1,
         qseq_len= str_length(query_seq), 
         sseq_len = str_length(subject_seq )) %>% 
  select(seqnames, start, end, strand, qGR_len:sseq_len) %>% 
  as_granges() %>% 
  as_tibble() %>% 
  mutate(isittrue = qGR_len == sGR_len & qseq_len == sseq_len & qGR_len == sseq_len & qGR_len == width ) %>% 
  count(isittrue)

# Function?
read_blast_tab_format <- function(blast_file, delim = "@"){
  fourth_line_fields <- blast_file %>% 
  vroom_lines() %>% 
  .[4] %>% 
  str_remove("# Fields: ") %>% 
  str_split(", ") %>% 
  unlist %>% 
  str_replace_all(" |/","_") %>% 
  str_replace("%","percent") %>% 
  str_remove("\\.")
  
  blast_sequences <- blast_file %>% 
  read_delim(delim = delim, # here it should be made clear that the delim could be something else
             comment = "#", 
             col_names = fourth_line_fields) %>% 
  mutate(strand = if_else(subject_strand == "minus","-","+") %>% as_factor(),
         start = if_else(strand == "-", s_end, s_start),
         end = if_else(strand == "+", s_start, s_end),
         seqnames = subject_id) %>% 
    as_granges()
  
  return(blast_sequences)
}

Small reproducible example

See above :)

R session information

Remember to include your full R session information.

Session info -------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/Berlin               
 date     2021-06-05                  

- Packages -----------------------------------------------------------------------------------------------------------
 package              * version  date       lib source        
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.0.0)
 backports              1.2.0    2020-11-02 [1] CRAN (R 4.0.3)
 Biobase                2.48.0   2020-04-27 [1] Bioconductor  
 BiocGenerics         * 0.34.0   2020-04-27 [1] Bioconductor  
 BiocParallel           1.22.0   2020-04-27 [1] Bioconductor  
 Biostrings             2.56.0   2020-04-27 [1] Bioconductor  
 bit                    4.0.4    2020-08-04 [1] CRAN (R 4.0.3)
 bit64                  4.0.5    2020-08-30 [1] CRAN (R 4.0.3)
 bitops                 1.0-6    2013-08-17 [1] CRAN (R 4.0.0)
 broom                  0.7.4    2021-01-29 [1] CRAN (R 4.0.3)
 cellranger             1.1.0    2016-07-27 [1] CRAN (R 4.0.0)
 cli                    2.3.0    2021-01-31 [1] CRAN (R 4.0.3)
 clipr                  0.7.1    2020-10-08 [1] CRAN (R 4.0.3)
 colorspace             2.0-0    2020-11-11 [1] CRAN (R 4.0.3)
 crayon                 1.4.0    2021-01-30 [1] CRAN (R 4.0.3)
 DBI                    1.1.1    2021-01-15 [1] CRAN (R 4.0.3)
 dbplyr                 2.0.0    2020-11-03 [1] CRAN (R 4.0.3)
 DelayedArray           0.14.1   2020-07-15 [1] Bioconductor  
 dplyr                * 1.0.3    2021-01-15 [1] CRAN (R 4.0.3)
 ellipsis               0.3.1    2020-05-15 [1] CRAN (R 4.0.0)
 fansi                  0.4.2    2021-01-15 [1] CRAN (R 4.0.3)
 forcats              * 0.5.1    2021-01-27 [1] CRAN (R 4.0.3)
 fs                     1.5.0    2020-07-31 [1] CRAN (R 4.0.3)
 generics               0.1.0    2020-10-31 [1] CRAN (R 4.0.3)
 GenomeInfoDb         * 1.24.2   2020-06-15 [1] Bioconductor  
 GenomeInfoDbData       1.2.3    2020-06-24 [1] Bioconductor  
 GenomicAlignments      1.24.0   2020-04-27 [1] Bioconductor  
 GenomicRanges        * 1.40.0   2020-04-27 [1] Bioconductor  
 ggplot2              * 3.3.3    2020-12-30 [1] CRAN (R 4.0.3)
 glue                   1.4.2    2020-08-27 [1] CRAN (R 4.0.3)
 gtable                 0.3.0    2019-03-25 [1] CRAN (R 4.0.0)
 haven                  2.3.1    2020-06-01 [1] CRAN (R 4.0.0)
 hms                    1.0.0    2021-01-13 [1] CRAN (R 4.0.3)
 httr                   1.4.2    2020-07-20 [1] CRAN (R 4.0.3)
 IRanges              * 2.22.2   2020-05-21 [1] Bioconductor  
 jsonlite               1.7.2    2020-12-09 [1] CRAN (R 4.0.3)
 knitr                  1.31     2021-01-27 [1] CRAN (R 4.0.3)
 lattice                0.20-41  2020-04-02 [2] CRAN (R 4.0.3)
 lifecycle              0.2.0    2020-03-06 [1] CRAN (R 4.0.0)
 lubridate              1.7.9.2  2020-11-13 [1] CRAN (R 4.0.3)
 magrittr               2.0.1    2020-11-17 [1] CRAN (R 4.0.3)
 Matrix                 1.2-18   2019-11-27 [1] CRAN (R 4.0.3)
 matrixStats            0.58.0   2021-01-29 [1] CRAN (R 4.0.3)
 modelr                 0.1.8    2020-05-19 [1] CRAN (R 4.0.0)
 munsell                0.5.0    2018-06-12 [1] CRAN (R 4.0.0)
 pillar                 1.4.7    2020-11-20 [1] CRAN (R 4.0.3)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.0.0)
 plyranges            * 1.8.0    2020-04-27 [1] Bioconductor  
 purrr                * 0.3.4    2020-04-17 [1] CRAN (R 4.0.0)
 R6                     2.5.0    2020-10-28 [1] CRAN (R 4.0.3)
 Rcpp                   1.0.6    2021-01-15 [1] CRAN (R 4.0.3)
 RCurl                  1.98-1.2 2020-04-18 [1] CRAN (R 4.0.0)
 readr                * 1.4.0    2020-10-05 [1] CRAN (R 4.0.3)
 readxl                 1.3.1    2019-03-13 [1] CRAN (R 4.0.0)
 reprex                 1.0.0    2021-01-27 [1] CRAN (R 4.0.3)
 rlang                  0.4.10   2020-12-30 [1] CRAN (R 4.0.3)
 Rsamtools              2.4.0    2020-04-27 [1] Bioconductor  
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.0.3)
 rtracklayer            1.48.0   2020-04-27 [1] Bioconductor  
 rvest                  0.3.6    2020-07-25 [1] CRAN (R 4.0.3)
 S4Vectors            * 0.26.1   2020-05-16 [1] Bioconductor  
 scales                 1.1.1    2020-05-11 [1] CRAN (R 4.0.0)
 sessioninfo            1.1.1    2018-11-05 [1] CRAN (R 4.0.0)
 stringi                1.5.3    2020-09-09 [1] CRAN (R 4.0.3)
 stringr              * 1.4.0    2019-02-10 [1] CRAN (R 4.0.0)
 SummarizedExperiment   1.18.2   2020-07-09 [1] Bioconductor  
 tibble               * 3.0.5    2021-01-15 [1] CRAN (R 4.0.3)
 tidyr                * 1.1.2    2020-08-27 [1] CRAN (R 4.0.3)
 tidyselect             1.1.0    2020-05-11 [1] CRAN (R 4.0.0)
 tidyverse            * 1.3.0    2019-11-21 [1] CRAN (R 4.0.3)
 utf8                   1.1.4    2018-05-24 [1] CRAN (R 4.0.0)
 vctrs                  0.3.6    2020-12-17 [1] CRAN (R 4.0.3)
 vroom                * 1.3.2    2020-09-30 [1] CRAN (R 4.0.3)
 withr                  2.4.1    2021-01-26 [1] CRAN (R 4.0.3)
 xfun                   0.20     2021-01-06 [1] CRAN (R 4.0.3)
 XML                    3.99-0.5 2020-07-23 [1] CRAN (R 4.0.3)
 xml2                   1.3.2    2020-04-23 [1] CRAN (R 4.0.0)
 XVector                0.28.0   2020-04-27 [1] Bioconductor  
 zlibbioc               1.34.0   2020-04-27 [1] Bioconductor 

Cheers! 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant