Skip to content

ramadatta/AnnVis

Repository files navigation

title author date output
Prokka Annotation Visualization
Prakki Sai Rama Sridatta
5/4/2020
html_document
keep_md
true

Tutorial to visualize prokka output using gggenes package

Let us see how to visualize prokka data. Let's have fun.

I hope you already ran prokka on your genomes using following commands

% prokka genome.fasta --outdir prokka_out --prefix genome

If you have multiple genomes, you can loop prokka:

% for d in $(ls *.fasta);
  do 
  echo $d; prokka "$d" --outdir "$d"_prokka_out --prefix "$d" --centre X --compliant;
  done

####################----Updating this section on September 8th,2021:START----####################

Working with plasmids

I actually wanted to annotate the plasmids using prokka. But sometime specific gene plasmids are not readily avaible in prokka database. So, we need to turn on the --proteins switch and provide the gbk file of the proteins we are interested to annotate in our plasmids.

Specifically, in my case, I want to annotate the betalactam resistant genes. For this, I use the betalactam from resfinder database available here.

After this, I extract all the NCBI accessions from the fasta file and download the gbk files for each of the betalactam gene like this:

grep '>' beta-lactam.fsa | grep -Po "_[A-Z].*" | sed 's/^_//g' >beta-lactam_accessions.list

time for i in $(cat beta-lactam_accessions.list | sed 's/:.*//g'); # it can be "time for i in CM004561 AY236073 AY034847;" if you know which genes you are dealing with
do 
  echo $i; 
  time curl -s  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${i}&rettype=gb&retmode=txt" >$i.gbk ; 
done

Previously we chose the closest plasmid. So, we combine all the gbk into one file to pass to prokka

cat CP068017.gbk *.gbk >Closest_Plasmid_AND_Carbapenem_ResGenes.gbk

time for d in $(ls *.fa); 
do 
  echo $d; 
  prokka "$d" --outdir "$d"_ClosePlasmid_ResGenes_prokka_out --prefix "$d" --centre X --compliant --proteins Closest_Plasmid_AND_Carbapenem_ResGenes.gbk; 
done

We can now clearly see the difference between the prokka annotation using default setting and prokka annotation using additional gbk files provided;

# Default settings without additional GBK files 

$ grep 'OXA' C1142_4_len_67445_circ_plasmid.fa_prokka_out/C1142_4_len_67445_circ_plasmid.fa.gff
gnl|X|JBPEFOOM_1	Prodigal:2.6	CDS	10356	11231	.	+	0	ID=JBPEFOOM_00011;Parent=JBPEFOOM_00011_gene;eC_number=3.5.2.6;Name=bla_1;gene=bla_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P13661;locus_tag=JBPEFOOM_00011;product=**Beta-lactamase OXA-1**;protein_id=gnl|X|JBPEFOOM_00011
gnl|X|JBPEFOOM_1	Prodigal:2.6	CDS	15650	16447	.	-	0	ID=JBPEFOOM_00017;Parent=JBPEFOOM_00017_gene;eC_number=3.5.2.6;Name=bla_2;gene=bla_2;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P14489;locus_tag=JBPEFOOM_00017;product=**Beta-lactamase OXA-10**;protein_id=gnl|X|JBPEFOOM_00017

# Default settings with additional GBK files 

$ grep 'OXA' C1142_4_len_67445_circ_plasmid.fa_ClosePlasmid_ResGenes_prokka_out/C1142_4_len_67445_circ_plasmid.fa.gff
gnl|X|JBPEFOOM_1	Prodigal:2.6	CDS	10356	11231	.	+	0	ID=JBPEFOOM_00011;Parent=JBPEFOOM_00011_gene;inference=ab initio prediction:Prodigal:2.6;locus_tag=JBPEFOOM_00011;note=**OXA-48**;product=hypothetical protein;protein_id=gnl|X|JBPEFOOM_00011
gnl|X|JBPEFOOM_1	Prodigal:2.6	CDS	15650	16447	.	-	0	ID=JBPEFOOM_00017;Parent=JBPEFOOM_00017_gene;inference=ab initio prediction:Prodigal:2.6;locus_tag=JBPEFOOM_00017;note=**OXA-48**;product=hypothetical protein;protein_id=gnl|X|JBPEFOOM_00017

Now let's make a coordinate file suitable to draw using gggenes package

grep "" *_prokka_out/*.gff | 
fgrep 'CDS' |  
sed -r 's/(note=.*);product=hypothetical protein/\1/g' | 
sed 's/note=/product=/g' | sed 's/;product/	product/' | 
sed 's/ /_/g' | awk '{print $1,$4,$5,$3,$7,$NF}' | 
fgrep -v 'hypothetical_protein' | 
sed  -e 's/.*_prokka_out\///g' | 
sed 's/:/	/g' | 
sed 's/	+	/	1	/g' | # tabs
sed 's/	-	/	-1	/g' | # tabs
sed -e 's/product=//g' -e 's/;protein_id=.*$//g' | 
awk '{$2="";print}' | 
cat -n | 
sed 's/.gff//g' | 
sed -e 's/ \+/ /g' -e 's/^ //g' -e 's/\t/ /g' | 
sed 's/ /,/g' | awk -F ',' '$6 ~ /+/ {print $0",forward"} $6 ~ /-/ {print $0",reverse"}' | 
sed 's/,CDS//g' >coordinates_files.gff

 (echo "No","molecule","start","end","direction","gene","strand" && cat coordinates_files.gff) > filename1 && mv filename1 coordinates_files.gff

grep -A 3 -B 3 'OXA-48,reverse' coordinates_files.gff | sed 's/--//g' | sed '/^$/d' >OXA-48_reverse.gff

$ grep -A 3 -B 3 'OXA-48,reverse' coordinates_files.gff | sed 's/--//g' | sed '/^$/d' >OXA-48_reverse.gff
$ grep -A 3 -B 3 'OXA-48,forward' coordinates_files.gff | sed 's/--//g' | sed '/^$/d' >OXA-48_forward.gff

$  (echo "No","molecule","start","end","direction","gene","strand" && cat OXA-48_forward.gff) > filename1 && mv filename1 OXA-48_forward.gff
$  (echo "No","molecule","start","end","direction","gene","strand" && cat OXA-48_reverse.gff) > filename1 && mv filename1 OXA-48_reverse.gff

####################----Updating this section on September 8th,2021:END----####################

make gggenes suitable format from prokka output

% grep "" */*.gff | fgrep 'CDS' \|
  awk '{print $1,$4,$5,$3,$7,$NF}' \|
  sed 's/;.*//g' \|
  sed 's/.fasta_prokka_out\/genome.*_1//g' \|
  sed -i 's/ + / 1 /g' \|
  sed -i 's/ - / -1 /g' >coordinates_files.gff

EDIT1: grep "" *.gff | fgrep 'CDS' | sed 's/;product/  product/' | sed 's/ /_/g' | awk '{print $1,$4,$5,$3,$7,$NF}' | fgrep -v 'hypothetical_protein' | sed 's/.*.gff://g' | sed 's/ + / 1 /g' | sed 's/ - / -1 /g' >coordinates_files.gff



Lets jump into R from here

setwd("/home/datta/Desktop/CPE/ProkkaVis")
#getwd()
#files()

Let us load the coordinates_files.gff into a dataframe

coords <- read.table("coordinates_files.gff",header = FALSE, sep = " ")
head(coords)
##          V1   V2   V3  V4 V5      V6
## 1 genome_10    1  948 CDS  1 protein
## 2 genome_10 1356 1748 CDS  1 protein
## 3 genome_10 1730 2068 CDS  1    MbeC
## 4 genome_10 2049 3503 CDS  1    MbeA
## 5 genome_10 3570 3686 CDS -1 protein
## 6 genome_10 3785 4066 CDS -1 protein

Let us add the headers for the dataframe

#names(coords) <- c("genome","start","end","source","frame","cds")
names(coords) <- c("molecule","start","end","source","direction","gene")
head(coords)
##    molecule start  end source direction    gene
## 1 genome_10     1  948    CDS         1 protein
## 2 genome_10  1356 1748    CDS         1 protein
## 3 genome_10  1730 2068    CDS         1    MbeC
## 4 genome_10  2049 3503    CDS         1    MbeA
## 5 genome_10  3570 3686    CDS        -1 protein
## 6 genome_10  3785 4066    CDS        -1 protein
coords2 <- within(coords, rm(source))
coords2$strand[coords$direction=="1"] <- "forward"
coords2$strand[coords$direction=="-1"] <- "reverse"
head(coords2)
##    molecule start  end direction    gene  strand
## 1 genome_10     1  948         1 protein forward
## 2 genome_10  1356 1748         1 protein forward
## 3 genome_10  1730 2068         1    MbeC forward
## 4 genome_10  2049 3503         1    MbeA forward
## 5 genome_10  3570 3686        -1 protein reverse
## 6 genome_10  3785 4066        -1 protein reverse

Time to visualize. Lets load gggenes & ggplot2 package

library(gggenes)
library(ggplot2)
ggplot(subset(coords2, molecule == "genome_9"),
       aes(xmin = start, xmax = end, y = molecule, fill = gene, forward = direction)) +
  geom_gene_arrow()

Lets now draw multiple genomes in the plot

#head(coords)
p <- ggplot(coords2, aes(xmin = start, xmax = end, y = molecule, fill = gene)) 
p + geom_gene_arrow() + facet_wrap(~ molecule, scales = "free", ncol = 1) + scale_fill_brewer(palette = "Set3") +
  theme_classic(base_size = 3)

But this does not look so elegant. And without the classic theme, there is constantly error call in R because of the size fonts.

Error in grid.Call(C_convert, x, as.integer(whatfrom), as.integer(whatto)...

So, we save the plot in a svg file using ggsave command

q <- ggplot(coords2, aes(xmin = start, xmax = end, y = molecule, fill = gene)) +  geom_gene_arrow() + facet_wrap(~ molecule, scales = "free", ncol = 1) + scale_fill_brewer(palette = "Set3")

ggsave("pic1_gnome_warrows.svg", plot = q, device = svg, width = 250, height = 250, units = "mm", dpi=300)
#ggsave("pic1_gnome_warrows.png", plot = q, device = png, width = 2000, height = 2000, units = "mm", dpi=30)
#ggsave("pic1_gnome_warrows.jpeg", plot = q, device = jpeg, limitsize = FALSE)

The same above plots saved as svg file using ggsave generates a more elegant plot like this one below:

knitr::include_graphics("pic1_gnome_warrows.svg")

Beautifying the plot with theme_genes()

Because the resulting plot look a bit cluttered, a ‘ggplot2’ theme theme_genes() is provided with some sensible defaults.

q2 <-  ggplot(coords2, aes(xmin = start, xmax = end, y = molecule, fill = gene)) +  geom_gene_arrow() + facet_wrap(~ molecule, scales = "free", ncol = 1) + scale_fill_brewer(palette = "Set3") + theme_genes()

ggsave("pic2_gnome_warrows_lessclutter.svg", plot = q2, device = svg, width = 250, height = 250, units = "mm", dpi=300)

knitr::include_graphics("pic2_gnome_warrows_lessclutter.svg")

Aligning OXA-10 genes across facets using make_alignment_dummies()

With that done, sometimes we need to vertically align a gene across the genomes. This could be easily achieved by make_alignment_dummies(). make_alignment_dummies() generates a set of ‘dummy’ genes that if added to the plot with ggplot2::geom_blank() will extend the range of each facet to visually align the selected gene across facets. Lets see how that works!

dummies <- make_alignment_dummies(
  coords2,
  aes(xmin = start, xmax = end, y = molecule, id = gene),
  on = "OXA-10"
)

q3 <-  ggplot(coords2, aes(xmin = start, xmax = end, y = molecule, fill = gene)) +  geom_gene_arrow() + geom_blank(data = dummies) + facet_wrap(~ molecule, scales = "free", ncol = 1) + scale_fill_brewer(palette = "Set3") + theme_genes()

ggsave("pic3_gnomes_aligned_byGene.svg", plot = q2, device = svg, width = 250, height = 250, units = "mm", dpi=300)

knitr::include_graphics("pic3_gnomes_aligned_byGene.svg")

Labelling genes with geom_gene_label()

To label individual genes, let us provide a label aesthetic and use geom_gene_label().

q4 <- ggplot(coords2, aes(xmin = start, xmax = end, y =molecule, fill = gene, label = gene)) +
  geom_gene_arrow(arrowhead_height = unit(5, "mm"), arrowhead_width = unit(3, "mm")) +
  geom_gene_label(align = "left") +
  geom_blank(data = dummies) +
  facet_wrap(~ molecule, scales = "free", ncol = 1) +
  scale_fill_brewer(palette = "Set3") +
  theme_genes()

ggsave("pic4_labelarrows_wgeneNames.svg", plot = q4, device = svg, width = 250, height = 250, units = "mm", dpi=300)

knitr::include_graphics("pic4_labelarrows_wgeneNames.svg")

Directionality to the gene arrows

#head(coords2)
q5 <- ggplot(coords2, aes(xmin = start, xmax = end, y = molecule, fill = gene, label= gene, forward = direction)) +
  geom_gene_arrow(arrowhead_height = unit(5, "mm"), arrowhead_width = unit(3, "mm")) +
  geom_gene_label(align = "left") + facet_wrap(~ molecule, scales = "free", ncol = 1) +
  scale_fill_brewer(palette = "Set3") +
  theme_genes()

ggsave("pic5_labelarrows_wgeneNames_strand.svg", plot = q5, device = svg, width = 250, height = 250, units = "mm", dpi=300)

knitr::include_graphics("pic5_labelarrows_wgeneNames_strand.svg")

Problems I encountered

  1. loading the image in the html. So, I had to save the figures using ggsave in svg format. Specifically svg format because png, jpeg 's width & height even after increasing did not generate the required image format like svg image file

  2. After generating, I have loaded the svg images using knitr::include_graphics. This works well if the output is knitr html. But for knitr pdf, this did not work. So, I was left with 0creating the html output option

  3. To generate a pdf from html file, I used pandoc. This would generate a nice pdf in latex. But, the code and figure are cut off on the right side if excceeds the page width. So far, could not find fix for this.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published