Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different p value estimates between bulk and single gene set testing #6

Open
teng-gao opened this issue Jan 9, 2022 · 3 comments
Open

Comments

@teng-gao
Copy link

teng-gao commented Jan 9, 2022

Thanks for creating this tool, I'm finding it very useful for my analysis. I am trying to first screen a number of gene sets using the bulk option and then visualize specific ones that are significant. However, I noticed that for the same gene set the p value being shown on the plot is different from the bulk testing. I understand that these are empirical p values so the exact values may not match. However, they are different by an order of magnitude (see below). Do you have insights into why this is the case?

gsea_out = bulk.gsea(
    values = fold_changes %>% 
        filter(gene %in% expressed_genes) %>%
        filter(!is.infinite(logFC)) %>% 
        {setNames(.$logFC, .$gene)},
    set.list = h_gene_sets,
    mc.cores = 10
)

Screen Shot 2022-01-09 at 9 14 39 AM

for (gs in c('INTERFERON_GAMMA_RESPONSE', 'INTERFERON_ALPHA_RESPONSE')) {

    gsea(
        fold_changes %>% 
            filter(gene %in% expressed_genes) %>% 
            filter(!is.infinite(logFC)) %>% 
            {setNames(.$logFC, .$gene)},
        h_gene_sets[[gs]],
        main = str_remove(gs, 'HALLMARK_')
    )
  
}

Screen Shot 2022-01-09 at 9 15 36 AM

@JEFworks
Copy link
Owner

JEFworks commented Jan 9, 2022

Dear Teng,

Thanks for sharing your output and the well organized issue report.

The differences you see may be attributable to the number of permutations being done to estimate the empirical p-value in the gsea vs. bulk.gsea functions. We can show how the number of permutations relates to the estimated the empirical p-value by explicitly setting the number of permutations in the following example:

library(liger)
# load gene set
data("org.Hs.GO2Symbol.list")  
# get universe
universe <- unique(unlist(org.Hs.GO2Symbol.list))
# get a gene set
gs <- org.Hs.GO2Symbol.list[[1]]
# fake dummy example where everything in gene set is perfectly enriched
vals <- rnorm(length(universe), 0, 10)
names(vals) <- universe
vals[gs] <- rnorm(length(gs), 100, 10)
head(vals)  # look at vals

gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=1000)
bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=1000)$p.val
gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=10000)
bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=10000)$p.val

Note that with 1000 permutations, the best we can estimate the true p-value is < 1/1000

> gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=1000)
[1] 0.001
> bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=1000)$p.val
[1] 0.000999001

With 10,000 permutations, we can get more precise to < 1/10,000.

> gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE, n.rand=10000)
[1] 1e-04
> bulk.gsea(values=vals, set.list=list(gs), mc.cores=1, n.rand=10000)$p.val
[1] 9.999e-05

Hope this helps!

Stay healthy and safe,
Jean

@teng-gao
Copy link
Author

teng-gao commented Jan 10, 2022

Hi Jean,

Thanks for the detailed response!

I actually tried to set n.rand=1e4 and used the same random seed for both bulk and single gene set testing, however the discrepancy seemed to remain. I'm attaching the gene sets and logFC values below in case that is helpful to reproduce the issue. Interestingly, it also seems that the p values from bulk testing are consistently lower.

geneset_and_logfc.zip

@JEFworks
Copy link
Owner

JEFworks commented Jan 14, 2022

Dear Teng,

Thanks for sharing the geneset and logFC values you were using; very helpful in diagnostics.

I've compared the p-values from bulk.gsea, iterative.bulk.gsea and gsea and have found that if, rank=TRUE, then all 3 give the same results as expected:

> results <- iterative.bulk.gsea(logFC, set.list=geneset, rank=TRUE)
initial: [1e+02 - 20] [1e+03 - 8] [1e+04 - 5] done
> sig <- rownames(results)[results$q.val < 0.05]
> results[sig,]
                               p.val       q.val    sscore         edge
ANDROGEN_RESPONSE         0.00379962 0.031663500 -1.104368  0.009336836
INTERFERON_ALPHA_RESPONSE 0.00009999 0.002499750 -1.448063 -0.173254577
INTERFERON_GAMMA_RESPONSE 0.00069993 0.006999300 -1.254465 -0.030810312
MTORC1_SIGNALING          0.00689931 0.043120688  1.034815 -0.280169781
MYC_TARGETS_V1            0.00009999 0.002499750  1.775444 -0.264572926
OXIDATIVE_PHOSPHORYLATION 0.00029997 0.004999500  1.348529 -0.280169781
PROTEIN_SECRETION         0.00049995 0.006249375 -1.314850  0.015286103
UNFOLDED_PROTEIN_RESPONSE 0.00569943 0.040710215  1.065328 -0.269198663
> results.test <- bulk.gsea(logFC, set.list=geneset[sig], n.rand=1e4, rank=TRUE)
> results.test
                               p.val       q.val    sscore         edge
ANDROGEN_RESPONSE         0.00379962 0.006400000 -1.104368  0.009336836
INTERFERON_ALPHA_RESPONSE 0.00009999 0.001100000 -1.448063 -0.173254577
INTERFERON_GAMMA_RESPONSE 0.00069993 0.001966667 -1.254465 -0.030810312
MTORC1_SIGNALING          0.00689931 0.013975000  1.034815 -0.280169781
MYC_TARGETS_V1            0.00009999 0.000100000  1.775444 -0.264572926
OXIDATIVE_PHOSPHORYLATION 0.00029997 0.001550000  1.348529 -0.280169781
PROTEIN_SECRETION         0.00049995 0.001450000 -1.314850  0.015286103
UNFOLDED_PROTEIN_RESPONSE 0.00569943 0.013975000  1.065328 -0.269198663
> results.ind <- do.call(rbind, lapply(sig, function(gs) {
+   gsea(values = logFC, geneset = geneset[[gs]], main=gs, n.rand=1e4, return.details=TRUE, rank=TRUE)
+ }))
> rownames(results.ind) <- sig
> results.ind[, c(1,2,4,3)]
                           p.val edge.score scaled.score   edge.value
ANDROGEN_RESPONSE         0.0038  -2185.562     1.104368  0.009336836
INTERFERON_ALPHA_RESPONSE 0.0001  -2797.468     1.448063 -0.173254577
INTERFERON_GAMMA_RESPONSE 0.0007  -1794.328     1.254465 -0.030810312
MTORC1_SIGNALING          0.0069   1363.912     1.034815 -0.280169781
MYC_TARGETS_V1            0.0001   2248.408     1.775444 -0.264572926
OXIDATIVE_PHOSPHORYLATION 0.0003   1673.174     1.348529 -0.280169781
PROTEIN_SECRETION         0.0005  -2448.887     1.314850  0.015286103
UNFOLDED_PROTEIN_RESPONSE 0.0057   1794.249     1.065328 -0.269198663

However, when rank=FALSE, and the actual inputted values are used, the gsea function does indeed give the incorrect p-value (due to a discrepancy in how the scaled score is calculated), resulting in difference you saw:

> results <- iterative.bulk.gsea(logFC, set.list=geneset, rank=FALSE)
initial: [1e+02 - 21] [1e+03 - 10] [1e+04 - 5] done
> sig <- rownames(results)[results$q.val < 0.05]
> results[sig,]
                               p.val      q.val    sscore       edge
ADIPOGENESIS              0.00019998 0.00333300  1.346984 -0.7797408
INTERFERON_ALPHA_RESPONSE 0.00139986 0.01166550 -1.189337 -0.4057007
INTERFERON_GAMMA_RESPONSE 0.00079992 0.00799920 -1.281145 -0.2995351
MTORC1_SIGNALING          0.00569943 0.04071021  1.069090 -0.6127451
MYC_TARGETS_V1            0.00009999 0.00249975  1.497201 -0.9431348
OXIDATIVE_PHOSPHORYLATION 0.00009999 0.00249975 -1.746883  0.5227934
PROTEIN_SECRETION         0.00059994 0.00749925 -1.285763  0.2966188
> results.test <- bulk.gsea(logFC, set.list=geneset[sig], n.rand=1e4, rank=FALSE)
> results.test
                               p.val    q.val    sscore       edge
ADIPOGENESIS              0.00019998 0.000750  1.346984 -0.7797408
INTERFERON_ALPHA_RESPONSE 0.00139986 0.002850 -1.189337 -0.4057007
INTERFERON_GAMMA_RESPONSE 0.00079992 0.001700 -1.281145 -0.2995351
MTORC1_SIGNALING          0.00569943 0.008875  1.069090 -0.6127451
MYC_TARGETS_V1            0.00009999 0.000300  1.497201 -0.9431348
OXIDATIVE_PHOSPHORYLATION 0.00009999 0.000000 -1.746883  0.5227934
PROTEIN_SECRETION         0.00059994 0.001700 -1.285763  0.2966188
> results.ind <- do.call(rbind, lapply(sig, function(gs) {
+   gsea(values = logFC, geneset = geneset[[gs]], main=gs, n.rand=1e4, return.details=TRUE, rank=FALSE)
+ }))
> rownames(results.ind) <- sig
> results.ind[, c(1,2,4,3)]
                           p.val edge.score scaled.score edge.value
ADIPOGENESIS              0.0010   2494.709     1.242194 -0.7797408
INTERFERON_ALPHA_RESPONSE 0.0034  -3127.154     1.119215 -0.4057007
INTERFERON_GAMMA_RESPONSE 0.0015  -2463.756     1.183261 -0.2995351
MTORC1_SIGNALING          0.0039   2099.706     1.105981 -0.6127451
MYC_TARGETS_V1            0.0001   2674.405     1.455259 -0.9431348
OXIDATIVE_PHOSPHORYLATION 0.0001  -2919.925     1.581537  0.5227934
PROTEIN_SECRETION         0.0010  -3142.772     1.195370  0.2966188

Thanks for the catch. For now, please ignore the visualized p-values from the gsea function when rank=FALSE.

Generally, since you are testing many gene sets, if you intend to display these results, I would actually strongly advise displaying the multiple-testing corrected q-values from bulk testing instead of raw p-values.

Thanks again.

Stay healthy and safe,
Jean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants