Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emmaplot with the top activated and supressed from dotplot? #658

Open
Leandromvelez opened this issue Jan 14, 2024 · 4 comments
Open

Emmaplot with the top activated and supressed from dotplot? #658

Leandromvelez opened this issue Jan 14, 2024 · 4 comments

Comments

@Leandromvelez
Copy link

Hello dear Clusterprofiler team!
When segregating for activated and suppressed terms (dotplot with sign split), I want to make an emmaplot with the same top 20 terms from the dotplot. Is that possible? Thank you very much in advance

library(DOSE)
library(clusterProfiler)
library(enrichplot)
data(geneList)

edo2 <- gseDO(geneList)
clusterProfiler::dotplot(edo2, showCategory=10, split=".sign") + facet_grid(.~.sign)
edo <- pairwise_termsim(edo2)
emapplot(edo, showCategory=20) ## This will take the first terms, based on pvalues, and I want to plot the top 10 activated and top 10 suppressed from the dotplot.

@guidohooiveld
Copy link

guidohooiveld commented Jan 16, 2024

If you would like to show in the emapplot the same gene sets as in the dotplot, one option is that you:

  1. explicitly assign the output (=dotplot) to an object, 2) extract from this object the gene sets that are plotted, and 3) only show these in the emapplot.
    BTW: note that the function dotplot is part of the enrichplot package, and not clusterProfiler.
> library(DOSE)
> library(clusterProfiler)
> library(enrichplot)
> data(geneList)
> 
> edo2 <- gseDO(geneList, eps=0)
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
> 
> ## 1) assign dotplot to an object
> ##    note that NO graph is plotted yet!
> p <- enrichplot::dotplot(edo2, showCategory=10, split=".sign") + facet_grid(.~.sign)
> 
> ## plot the graph (dotplot)
> print(p)
> 
> ## 2) Extract the gene sets. These are the 20 sets that are in the dotplot (10 activated, 10 supressed).
> geneSets2plot <- p$data$Description
> 
> ## check
> geneSets2plot
 [1] tuberculosis                                   
 [2] primary immunodeficiency disease               
 [3] primary bacterial infectious disease           
 [4] bacterial infectious disease                   
 [5] pulmonary tuberculosis                         
 [6] combined immunodeficiency                      
 [7] human immunodeficiency virus infectious disease
 [8] embryonal cancer                               
 [9] malaria                                        
[10] embryoma                                       
[11] abdominal obesity-metabolic syndrome 1         
[12] abdominal obesity-metabolic syndrome           
[13] essential hypertension                         
[14] bone structure disease                         
[15] impotence                                      
[16] sexual dysfunction                             
[17] uterine fibroid                                
[18] polycystic ovary syndrome                      
[19] degeneration of macula and posterior pole      
[20] ovarian dysfunction                            
20 Levels: embryoma ... bone structure disease
> 
> ## 3) make a emapplot showing only the selected gene sets
> edo <- pairwise_termsim(edo2)
> p2 <- enrichplot::emapplot(edo, showCategory=geneSets2plot)
> 
> ## plot the graph (emapplot)
> print(p2)
> 
> 

dotplot:
dotplot

emapplot:
emapplot

@Leandromvelez
Copy link
Author

Leandromvelez commented Jan 16, 2024

Thank you very much for your response, and sorry for not explaining this issue in more detail:
The problem is the error "Error in pair_sim[as.character(enrichDf$Description), as.character(enrichDf$Description)] :
subscript out of bounds" and only occurred with some of the genelists I'm working on. For some gse results, the solution you posted, works, but not for some others. It seems there is a problem with the order of the terms, but I reordered the terms to match the object used by emapplot, but I'm still getting the error. I'm uploading one of the gse objects giving me the error (hope this links works, an environment with the gse object https://drive.google.com/file/d/172HmlPfgtueBLBqCcZUcOdSrxZecNUxB/view?usp=sharing )

, and the scripts I've been using:

library(clusterProfiler)
library(enrichplot)

g2 = clusterProfiler::dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)
print (g2)
terms = as.character(g2$data$Description)
terms
p1 <- pairwise_termsim(gse)

matched_indices = p1@result$Description[p1@result$Description %in% terms]

checked the order of terms for terms and matched_indices

enrichplot::emapplot(p1 , showCategory= matched_indices)

Error in pair_sim[as.character(enrichDf$Description), as.character(enrichDf$Description)] :

subscript out of bounds

Thanks in advance!

@guidohooiveld
Copy link

guidohooiveld commented Jan 17, 2024

When posting, please format your code! Select that text, and use the <> button to format as code.

I could reproduce your problem with the provided gse object. Combined with a traceback(), it becomes clear it has to do with the similarity matrix.

> enrichplot::emapplot(p1 , showCategory= terms)
Error in pair_sim[as.character(enrichDf$Description), as.character(enrichDf$Description)] : 
  subscript out of bounds
>
> 
> traceback()
6: build_emap_graph(enrichDf = y, geneSets = geneSets, color = color, 
       cex_line = cex_line, min_edge = min_edge, pair_sim = x@termsim, 
       method = x@method)
5: get_igraph(x = x, nCategory = n, color = color, cex_line = cex_line, 
       min_edge = min_edge)
4: emapplot.enrichResult(x, showCategory = showCategory, ...)
3: .local(x, ...)
2: enrichplot::emapplot(p1, showCategory = terms)
1: enrichplot::emapplot(p1, showCategory = terms)

Note that the total number of significant gene sets in your object gse is 1338:

> dim(gse)
[1] 1338   11
>

.... whereas the dimensions of the similarity matrix is (only) 200 x 200:

> dim(p1@termsim)
[1] 200 200
>

This is because for the function pairwise_termsim() the default setting is showCategory = 200. See ?pairwise_termsim; showCategory - number of enriched terms to display, default value is 200.

Combine this with fact that within gse, by default, the gene sets are ranked on significance, but in the dotplot the gene sets are ranked on GeneRatio.

It thus turns out that some of gene sets that have a large GeneRatio, are not in the top 200 of significant gene sets. As a consequence of this, sub-setting of p1 goes wrong when a set of gene sets based on GeneRatio is provided.... because the required similarity matrix is missing for these gene sets!

This can be simply solved by calculating the similarity matrix for all gene sets, and not only the top 200.

Therefore (note that I also modified [the order of] your code slightly):

> library(clusterProfiler)
> library(enrichplot)
> load("gse.RData")
> 
> ## check
> gse
#
# Gene Set Enrichment Analysis
#
#...@organism    Homo sapiens 
#...@setType     BP 
#...@keytype     SYMBOL 
#...@geneList    Named num [1:33511] 2.57 2.48 2.19 2.19 2.12 ...
 - attr(*, "names")= chr [1:33511] "LHFPL4" "FBXO40" "OR1D3P" "DAAM2-AS1" ...
#...nPerm        
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...1338 enriched terms found
'd
 $ ID             : chr  "GO:0006302" "GO:0006397" "GO:0018205" "GO:0007059" ...
 $ Description    : chr  "double-strand break repair" "mRNA processing" "peptidyl-lysine modification" "chromosome segregation" ...
 $ setSize        : int  303 484 372 433 468 310 434 364 326 242 ...
 $ enrichmentScore: num  -0.538 -0.475 -0.502 -0.485 -0.472 ...
 $ NES            : num  -2 -1.79 -1.88 -1.82 -1.78 ...
 $ pvalue         : num  2.01e-24 5.57e-23 3.34e-22 7.31e-22 8.73e-22 ...
 $ p.adjust       : num  1.28e-20 1.78e-19 7.10e-19 1.11e-18 1.11e-18 ...
 $ qvalue         : num  9.08e-21 1.26e-19 5.03e-19 7.90e-19 7.90e-19 ...
 $ rank           : num  8691 10625 9127 12397 9312 ...
 $ leading_edge   : chr  "tags=59%, list=26%, signal=44%" "tags=57%, list=32%, signal=40%" "tags=57%, list=27%, signal=42%" "tags=63%, list=37%, signal=40%" ...
 $ core_enrichment: chr  "SFR1/NUCKS1/MBTD1/OTUB1/ZBTB7A/YEATS4/CDCA5/RPA1/PARP2/DPF2/HELB/RPA3/PNKP/ATRIP/RIF1/AP5Z1/RAD50/WRAP53/CDC45/"| __truncated__ "SRSF8/RNPC3/SF3B5/SNRNP25/TSEN34/ZRANB2/SF3B6/APOBEC4/HNRNPR/BRDT/HNRNPK/TTF2/RBM14/HABP4/AKAP17A/KHDRBS1/PTBP1"| __truncated__ "ZBED1/SIRT5/YEATS4/FAM86B1/KAT6A/RELA/PPARGC1A/BCL11A/H1-3/KANSL1/CREBBP/PAGR1/RIF1/CBX4/ASH1L/SENP5/UBE2I/METT"| __truncated__ "SMC4/EML4/GOLGA8G/WAPL/SPAG5/ABRAXAS2/NCAPH/PPP2R1A/MACROH2A1/NTMT1/ACTR2/BCL7A/CDC42/UBE2C/PINX1/ARHGEF10/DPF3"| __truncated__ ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

> 
> ## apply 'pairwise_termsim' before making plots
> ## calculate the similarity matrix for all gene sets, and not the top 200!
> ## note that this takes some time
> gse <- pairwise_termsim(gse, showCategory = dim(gse)[1] )
> 
> 
> ## make dotplot
> g2 <- enrichplot::dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)
> print(g2)
> 
> ## extract the gene sets that are in the dotplot
> terms = as.character(g2$data$Description)
> terms
 [1] "linoleic acid metabolic process"                                  
 [2] "cysteine metabolic process"                                       
 [3] "response to food"                                                 
 [4] "ovulation"                                                        
 [5] "acetyl-CoA biosynthetic process from pyruvate"                    
 [6] "miRNA-mediated gene silencing by mRNA destabilization"            
 [7] "sequestering of triglyceride"                                     
 [8] "positive regulation of fatty acid metabolic process"              
 [9] "lipoxygenase pathway"                                             
[10] "regulation of amino acid import across plasma membrane"           
[11] "double-strand break repair"                                       
[12] "mRNA processing"                                                  
[13] "peptidyl-lysine modification"                                     
[14] "chromosome segregation"                                           
[15] "histone modification"                                             
[16] "regulation of response to DNA damage stimulus"                    
[17] "proteasome-mediated ubiquitin-dependent protein catabolic process"
[18] "vesicle organization"                                             
[19] "DNA recombination"                                                
[20] "protein polyubiquitination"                                       
> 
> 
> ## Make an emapplot showing only the selected gene sets
> g3 <- enrichplot::emapplot(gse, showCategory = terms)
> print(g3)
> 
> 
> 
>

dotplot g2:

dotplot

emapplot g3:
emapplot

@Leandromvelez
Copy link
Author

That worked perfectly!
Thank you very much for detailed explanation!
And sorry for not posting the lines in the correct format for code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants