Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: output the same p-value among different clusters #11

Open
Jackycccc opened this issue Nov 13, 2023 · 1 comment
Open

Error: output the same p-value among different clusters #11

Jackycccc opened this issue Nov 13, 2023 · 1 comment

Comments

@Jackycccc
Copy link

Hello,
Thanks for this awesome tool! I'm running regioneR with these 2 functions: commonRegions() and overlapPermTest(), but the p-value is the same for different groups of overlapPermTest().

First, DRR_region and DEG_region are two different datasets containing 5 clusters respectively, which are formed by chr, start and end. I turned my gene regions from data.frame() to formats that can be recognized by overlapPermTest()

library(regioneR)
DRR_bed_list <- list(C1 = commonRegions(DRR_region$C1,DRR_region$C1),C2 = commonRegions(DRR_region$C2,DRR_region$C2),C3 = commonRegions(DRR_region$C3,DRR_region$C3),C4 = commonRegions(DRR_region$C4,DRR_region$C4),C5 = commonRegions(DRR_region$C5,DRR_region$C5))

DEG_bed_list <- list(C1 = commonRegions(DEG_region$C1,DEG_region$C1),C2 = commonRegions(DEG_region$C2,DEG_region$C2),C3 = commonRegions(DEG_region$C3,DEG_region$C3),C4 = commonRegions(DEG_region$C4,DEG_region$C4),C5 = commonRegions(DEG_region$C5,DEG_region$C5))

Then, I used overlapPermTest() to test the 5 clusters of DRR_region and the 5 clusters of DEG_region respectively.

genome <-  getGenomeAndMask("hg38")$genome
C11 = overlapPermTest(A=DRR_bed_list$C1,B=DEG_bed_list$C1, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C12 = overlapPermTest(A=DRR_bed_list$C1,B=DEG_bed_list$C2, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C13 = overlapPermTest(A=DRR_bed_list$C1,B=DEG_bed_list$C3, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C14 = overlapPermTest(A=DRR_bed_list$C1,B=DEG_bed_list$C4, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C15 = overlapPermTest(A=DRR_bed_list$C1,B=DEG_bed_list$C5, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)

C21 = overlapPermTest(A=DRR_bed_list$C2,B=DEG_bed_list$C1, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C22 = overlapPermTest(A=DRR_bed_list$C2,B=DEG_bed_list$C2, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C23 = overlapPermTest(A=DRR_bed_list$C2,B=DEG_bed_list$C3, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C24 = overlapPermTest(A=DRR_bed_list$C2,B=DEG_bed_list$C4, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C25 = overlapPermTest(A=DRR_bed_list$C2,B=DEG_bed_list$C5, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)

C31 = overlapPermTest(A=DRR_bed_list$C3,B=DEG_bed_list$C1, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C32 = overlapPermTest(A=DRR_bed_list$C3,B=DEG_bed_list$C2, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C33 = overlapPermTest(A=DRR_bed_list$C3,B=DEG_bed_list$C3, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C34 = overlapPermTest(A=DRR_bed_list$C3,B=DEG_bed_list$C4, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C35 = overlapPermTest(A=DRR_bed_list$C3,B=DEG_bed_list$C5, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)

C41 = overlapPermTest(A=DRR_bed_list$C3,B=DEG_bed_list$C2, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C42 = overlapPermTest(A=DRR_bed_list$C4,B=DEG_bed_list$C2, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C43 = overlapPermTest(A=DRR_bed_list$C4,B=DEG_bed_list$C3, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C44 = overlapPermTest(A=DRR_bed_list$C4,B=DEG_bed_list$C4, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C45 = overlapPermTest(A=DRR_bed_list$C4,B=DEG_bed_list$C5, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)

C51 = overlapPermTest(A=DRR_bed_list$C5,B=DEG_bed_list$C1, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C52 = overlapPermTest(A=DRR_bed_list$C5,B=DEG_bed_list$C2, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C53 = overlapPermTest(A=DRR_bed_list$C5,B=DEG_bed_list$C3, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C54 = overlapPermTest(A=DRR_bed_list$C5,B=DEG_bed_list$C4, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)
C55 = overlapPermTest(A=DRR_bed_list$C5,B=DEG_bed_list$C5, ntimes=1000, genome=genome, non.overlapping=FALSE, verbose=TRUE)

Last, I got the p-value and z-score.

C11_P <- C11$numOverlaps$pval
C11_Z <- C11$numOverlaps$zscore

But the p-values of C21-C55 are the same while z-score are different. The following is specific output.

output p-value and z-score :
p-value z-score
C11 0.014985015 2.3609
C12 0.033966034 1.8495
C13 0.362637363 0.3327
C14 0.101898102 -1.3322
C15 0.339660340 0.4698
C21 0.000999001 9.8182
C22 0.000999001 10.6100
C23 0.000999001 8.0227
C24 0.000999001 8.0888
C25 0.000999001 8.9661
C31 0.000999001 9.0502
C32 0.000999001 12.9745
C33 0.000999001 9.7930
C34 0.000999001 15.3452
C35 0.000999001 12.4329
C41 0.000999001 12.2120
C42 0.000999001 4.3777
C43 0.000999001 5.2212
C44 0.000999001 3.7542
C45 0.000999001 7.5104
C51 0.000999001 6.7820
C52 0.000999001 9.3780
C53 0.000999001 7.4704
C54 0.000999001 5.5012
C55 0.000999001 6.0818

How do I get the correct p-value?
Looking forward to your reply, I would appreciate it if you could help me.

Thanks,
Jia Chen

@bernatgel
Copy link
Owner

Hi @Jackycccc

What you are getting is correct. Let me explain: regioneR uses an empirical p-value computation. That is, instead of creating a model and evaluating your data against the model, "simply" creates many random datasets and checks how often we see something similar to your original data. If we do not see it often, it means that your data is "special" (we see something that we should not see by chance). The p-value depends on the number of random datasets we create (ntimes) and the number of times we see something similar to your original observations. The p-value you are repeatedly getting is the smallest p-value you can get with only 1000 permutations. If you increase this value to 5000, for example, you might get more different p-values. That being said, at some point, it might even not really matter how small the p-value is if it's below a certain threshold.

In addition to this, it is possible (and expected) that the same pvalues produce different zscores in the different datasets. These two values are related but not directly linked.

In addition to that, if I understand correctly you are using commonRegions only to transfrom your data. You do not need to do that. overlapPermTest should be able to automatically transform de data.frames into GenomicRanges, and if you want to make the transformation explicit, you can use the toGRanges function as:

DRR_bed_list <- list(C1 = toGRanges(DRR_region$C1), C2=toGRanges(DRR_region$C2), etc

Hope this helps!

Bernat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants