Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promoter questions #3

Open
cb4github opened this issue Sep 1, 2017 · 6 comments
Open

promoter questions #3

cb4github opened this issue Sep 1, 2017 · 6 comments

Comments

@cb4github
Copy link

Dear Team,

First, thanks for all your efforts - in particular for encapsulating in goldmine and abbreviating what has been many a previous effort in mapping DM results to a RefSeq genome.

Please excuse if the following are not so much issues but questions about the code, which presumably then is working as designed.

Q1. Given a query range that overlaps a promoter of a subject gene for which there is no entry in the resulting goldmine gene table (such as row 1 in the example dmrQuery below), then is it true that there is no such entry in the resulting goldmine gene table since the given query range does not overlap with the subject gene body?

Q2. Given a query range that results in a value of promoter_per > 100 (such as row 2 in the example dmrQuery below), what is the meaning of context$promoter_per in this case?

Q3. What is the meaning of gene$Promoter?

Best,
Carl B.

Here's example code.

library(goldmine)
library(data.table)
dmrBedText = "track name = dmrTrack description= my DMR description
chr1 730452 730843 0.3575 -8.9775
chr1 31296092 31296616     0.2786      NA"
dmrBedRows <- read.table(text = dmrBedText, skip = 1)
setnames(dmrBedRows, c("chr", "start", "end", "meanPmDiff", "logPVal"))
dmrQuery <- makeGRanges(dmrBedRows)
# build local refseq DB
CACHE_DIR <- "gbcache"
GENOME <- "hg38"
REFSEQ <- "refseq"
refseqGenes <-
  getGenes(REFSEQ, genome = GENOME, cachedir = CACHE_DIR)
NUMBER_BP_UPSTREAM_OF_PROMOTER <- 2000
NUMBER_BP_DOWNSTREAM_OF_PROMOTER <- 200
dmrToRefseqOverlap <-
  goldmine(
    query = dmrQuery,
    genes = refseqGenes,
    promoter = c(
      NUMBER_BP_UPSTREAM_OF_PROMOTER,
      NUMBER_BP_DOWNSTREAM_OF_PROMOTER
    ),
    genome = GENOME,
    cachedir = CACHE_DIR
  )
dmrToRefseqOverlap$context
dmrToRefseqOverlap$genes

Here's the output.

Computing gene models
Generating context annotation - genes
Generating genes report
Generating exon/intron overlap diagrams
> dmrToRefseqOverlap$context
    chr    start      end width strand meanPmDiff logPVal qrow promoter_per end3_per exon_per intron_per
1: chr1   730452   730843   392      *     0.3575 -8.9775    1       100.00        0     0.00       0.00
2: chr1 31296092 31296616   525      *     0.2786      NA    2       103.62        0     1.14      98.86
   intergenic_per utr5_per utr3_per     call       call_genes overlapped_genes        nearest_genes
1:            100        0        0 promoter     LOC100133331     LOC100133331 LOC100133331, FAM87B
2:              0        0        0 promoter ZCCHC17, SNRNP40 SNRNP40, ZCCHC17              SNRNP40
   distance_to_nearest_gene                                                                               url
1:                      100     http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1%3A730452-730843
2:                        0 http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1%3A31296092-31296616
> dmrToRefseqOverlap$genes
   qrow  srow query.chr query.start query.end gene.symbol gene.id isoform.id isoform.chr isoform.start
1:    2 51109      chr1    31296092  31296616     SNRNP40 SNRNP40  NM_004814        chr1      31259568
   isoform.end isoform.strand overlap.bp query.overlap.per isoform.overlap.per noncoding Promoter
1:    31296797              -        525               100                1.41     FALSE     3.62
                                                                                                                                                       ExonIntron
1: E1 (1.14), I1 (98.86), E2 (0), I2 (0), E3 (0), I3 (0), E4 (0), I4 (0), E5 (0), I5 (0), E6 (0), I6 (0), E7 (0), I7 (0), E8 (0), I8 (0), E9 (0), I9 (0), E10 (0)
   3' End                                                                               url
1:      0 http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1%3A31296092-31296616
> 
@jeffbhasin
Copy link
Owner

Hello Carl,
Glad to hear you are finding Goldmine to be useful.

To answer your questions:

Q1. Given a query range that overlaps a promoter of a subject gene for which there is no entry in the resulting goldmine gene table (such as row 1 in the example dmrQuery below), then is it true that there is no such entry in the resulting goldmine gene table since the given query range does not overlap with the subject gene body?

Yes this is correct that there will not be a gene table entry for a region from the query regions if it does not overlap with any gene. All query regions do appear in the "context" table, but will only appear in the genes or features tables if the query region does indeed overlap with one of those subjects.

Q2. Given a query range that results in a value of promoter_per > 100 (such as row 2 in the example dmrQuery below), what is the meaning of context$promoter_per in this case?

This is a good question, as promoter_per by definition should have a max of 100. I ran your code example and looked at the data. I the correct value should be "100". I noticed that the row 2 region actually overlaps promoters of 2 genes (ZCCHC17 at 100% and SNRNP40 at 3.62%), and I suspect that somehow the values have been summed. This is something I'll need to look into further.

Q3. What is the meaning of gene$Promoter?

This is the percentage of the query region (given by chr/start/end of that row) that is overlapped by the promoter of the isoform given in "isoform.id" of that row.

I should point out that the genes table only shows a row for a query region if that query region directly overlaps with the gene body. Thus, overlaps restricted to promoters and gene 3' regions do not appear as rows in genes. This is why there is now a row for chr1:31296092-31296616 and ZCCHC17.

@jeffbhasin jeffbhasin added the bug label Sep 4, 2017
@jeffbhasin
Copy link
Owner

I believe I have found the source of this bug, and it was due to strand not being ignored in the calcPercentOverlap function. When a query region overlapped with two promoters that themselves also overlapped and those two promoters were on different strands the bug would be induced.

Note that Goldmine was written to ignore strand information for simplicity of implementation, and strand should have been ignored here as well. (Stranded analyses are however supported by pre-splitting the query, genes, and features and running goldmine() once for each strand.) In this instance, stranded information was being sent from the upstream getGeneModels() function. The fix (commit at da4370b) was to ignore this information, and then compute the percent overlaps. Overlapping promoters were already dealt with correctly by the code on the presumption that the function is fed an unstranded subject.gr argument.

I created a branch called issue-3. Please try it out and if this fixes the issue and induces no other issues I can pull it to master.

library(devtools)
install_github("jeffbhasin/goldmine",ref="issue-3")

Thanks very much for finding this and please let me know if you have any additional questions!

@cb4github
Copy link
Author

Hi Jeff,

Thanks for your reply, and I now have a couple of new questions.

Q4. Re SNRNP40, I'm not sure how to get the result 3.62% for overlap of the promoter. I keep getting ~0.9%.

library(intervals)
dmr2 <- Intervals(
  matrix(
    dmrToRefseqOverlap$genes[,c(query.start, query.end)],
    ncol = 2, byrow = FALSE
  ),
  closed = c( TRUE, TRUE ),
  type = "Z"
)
promoter2 <- Intervals(
  matrix(
    dmrToRefseqOverlap$genes[, isoform.end] + c(-200, 2000),
    ncol = 2, byrow = FALSE
  ),
  closed = c( TRUE, TRUE ),
  type = "Z"
)
> print(dmr2)
Object of class Intervals
1 interval over Z:
[31296092, 31296616]
> print(promoter2)
Object of class Intervals
1 interval over Z:
[31296597, 31298797]
> print(dmrIntersectPromoter2 <- interval_intersection(dmr2, promoter2))
Object of class Intervals
1 interval over Z:
[31296597, 31296616]
> print(sizeDmrIntersectPromoter2 <- size(dmrIntersectPromoter2))
[1] 20
> print(sizePromoter2 <- size(promoter2))
[1] 2201
> print(100*sizeDmrIntersectPromoter2/sizePromoter2)
[1] 0.9086779
> 

Q5. In my original example code above, I meant to use GENOME <- "hg19", so that when I rerun my example code with that modification, I get a new error as in the below...

...
> dmrToRefseqOverlap <-
+   goldmine(
+     query = dmrQuery,
+     genes = refseqGenes,
+     promoter = c(
+       NUMBER_BP_UPSTREAM_OF_PROMOTER,
+ .... [TRUNCATED] 
Computing gene models
Generating context annotation - genes
Generating genes report
Error in `$<-.data.frame`(`*tmp*`, "type", value = "E") : 
  replacement has 1 row, data has 0
> traceback()
9: stop(sprintf(ngettext(N, "replacement has %d row, data has %d", 
       "replacement has %d rows, data has %d"), N, nrows), domain = NA)
8: `$<-.data.frame`(`*tmp*`, "type", value = "E")
7: `$<-`(`*tmp*`, "type", value = "E")
6: reportGenes2(query.gr, genes.gr)
5: goldmine(query = dmrQuery, genes = refseqGenes, promoter = c(NUMBER_BP_UPSTREAM_OF_PROMOTER, 
       NUMBER_BP_DOWNSTREAM_OF_PROMOTER), genome = GENOME, cachedir = CACHE_DIR) at debugGoldmine.R#22
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("~/research/epigenetics/20161123/lebron/debugGoldmine.R", 
       echo = TRUE)

Please advise, thanks.

Best,
Carl B.

@jeffbhasin
Copy link
Owner

Hello Carl,

I investigated your question Q5 first, and found that this was due to the set of query ranges being a case where there were no intron/exon overlaps. Some code was running that only applies in case there are intron/exon overlaps, so I corrected the error by conditioning the processing over the intron/exon overlap table on there being actual overlap. Before I had really only run Goldmine on datasets with at least one exon/intron overlap where this case didn't arise.

You can see the commits to fix this issue here: 138a66f and c2bae07 and you can test it by pulling the latest commit from the "issue-3" branch as before.

Please let me know if any further issues. I will investigate your Q4 next.

Jeff

@jeffbhasin
Copy link
Owner

jeffbhasin commented Sep 6, 2017

Hello Carl,
I ran your code for Q4 and also did my own arithmetic. Goldmine computes length 19 bp for the overlap because it includes the transcription start bp as part of the upstream. Asides from that, I get the same values as you do. I found that the 3.62 is the percent of the DMR that is overlapped by promoter. The percentages are the fraction of overlap / width of query. I believe this is consistent with my response to Q3. At design time, I thought it was more logical to make these with respect to the DMR, so you see what % of the DMR is in each gene model portion, rather than the other way around.

The below code appended to the end of your own gives the 3.62 value, accounting for the off-by-in in our promoter conventions.

100*(sizeDmrIntersectPromoter2-1)/size(dmr2)

Jeff

Updated: I edited my comment because I think this is consistent with my answer to Q3

@cb4github
Copy link
Author

Hello Jeff,

I tried your updated version...

  source("http://bioconductor.org/biocLite.R")
  biocLite(c("GenomicRanges", "IRanges", "devtools"))
  # Then, install Goldmine from GitHub. Be sure to accept installation of any additional pre-requisite packages from CRAN.
  library(devtools)
  install_github("jeffbhasin/goldmine",ref = "issue-3")

... and it seems that another, perhaps old, problem as (re)arisen.

With the following code (tl;dr - please excuse my admittedly lengthy attempt at a simple, reproducible example)...

CACHE_DIR <- "gbcache"
GENOME <- "hg19"
REFSEQ <- "refseq"
refseqGenesDt <-  getGenes(REFSEQ, genome = GENOME, cachedir = CACHE_DIR)
dmr1 <- GRanges("chr1", IRanges(start = 754738, end = 755255))
dmr2 <- GRanges("chr1", IRanges(start = 783146, end = 783802))
testQuery <- dmr2
testSubject <- refseqGenesDt[gene.id == "LINC01128"][2:3]
testQuery
testSubject
testOverlap <-
  goldmine(
    query = testQuery,
    genes = testSubject,
    promoter = c(
      NUMBER_BP_UPSTREAM_FROM_TSS_IN_PROMOTER,
      NUMBER_BP_DOWNSTREAM_FROM_TSS_IN_PROMOTER
    ),
    genome = GENOME,
    cachedir = CACHE_DIR,
  )
testOverlap
testQuery2 <- c(dmr1, dmr2)
testSubject2 <- refseqGenesDt[gene.id == "FAM87B"]
testSubject2 <- rbind(testSubject2, testSubject)
testQuery2
testSubject2
testOverlap2 <-
  goldmine(
    query = testQuery2,
    genes = testSubject2,
    promoter = c(
      NUMBER_BP_UPSTREAM_FROM_TSS_IN_PROMOTER,
      NUMBER_BP_DOWNSTREAM_FROM_TSS_IN_PROMOTER
    ),
    genome = GENOME,
    cachedir = CACHE_DIR,
  )
testOverlap2

...I got the following output...

> CACHE_DIR <- "gbcache"

> GENOME <- "hg19"

> REFSEQ <- "refseq"

> refseqGenesDt <-  getGenes(REFSEQ, genome = GENOME, cachedir = CACHE_DIR)

> dmr1 <- GRanges("chr1", IRanges(start = 754738, end = 755255))

> dmr2 <- GRanges("chr1", IRanges(start = 783146, end = 783802))

> testQuery <- dmr2

> testSubject <- refseqGenesDt[gene.id == "LINC01128"][2:3]

> testQuery
GRanges object with 1 range and 0 metadata columns:
      seqnames           ranges strand
         <Rle>        <IRanges>  <Rle>
  [1]     chr1 [783146, 783802]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> testSubject
    chr  start    end strand      name   gene.id isoform.id cdsStart cdsEnd exonCount                                 exonStarts
1: chr1 762971 794826      + LINC01128 LINC01128  NR_047519   794826 794826         6 762970,764382,783033,787306,788050,788770,
2: chr1 762971 794826      + LINC01128 LINC01128  NR_047521   794826 794826         5        762970,764382,787306,788050,788770,
                                     exonEnds
1: 763155,764484,783186,787490,788146,794826,
2:        763155,764484,787490,788146,794826,

> testOverlap <-
+   goldmine(
+     query = testQuery,
+     genes = testSubject,
+     promoter = c(
+       NUMBER_BP_UPSTREAM_FROM_TSS_IN_PROMOTER .... [TRUNCATED] 
Computing gene models
Generating context annotation - genes
Generating genes report
Generating exon/intron overlap diagrams

> testOverlap
$context
    chr  start    end width strand qrow promoter_per end3_per exon_per intron_per intergenic_per utr5_per utr3_per call call_genes overlapped_genes
1: chr1 783146 783802   657      *    1            0        0     6.24        100              0        0        0 exon  LINC01128        LINC01128
   nearest_genes distance_to_nearest_gene                                                                           url
1:     LINC01128                        0 http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A783146-783802

$genes
   qrow srow query.chr query.start query.end gene.symbol   gene.id isoform.id isoform.chr isoform.start isoform.end isoform.strand overlap.bp
1:    1    1      chr1      783146    783802   LINC01128 LINC01128  NR_047519        chr1        762971      794826              +        657
2:    1    2      chr1      783146    783802   LINC01128 LINC01128  NR_047521        chr1        762971      794826              +        657
   query.overlap.per isoform.overlap.per noncoding Promoter
1:               100                2.06      TRUE        0
2:               100                2.06      TRUE        0
                                                                                      ExonIntron 3' End
1: E1 (0), I1 (0), E2 (0), I2 (0), E3 (6.24), I3 (93.91), E4 (0), I4 (0), E5 (0), I5 (0), E6 (0)      0
2:                      E1 (0), I1 (0), E2 (0), I2 (100), E3 (0), I3 (0), E4 (0), I4 (0), E5 (0)      0
                                                                             url
1: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A783146-783802
2: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A783146-783802

$features
list()


> testQuery2 <- c(dmr1, dmr2)

> testSubject2 <- refseqGenesDt[gene.id == "FAM87B"]

> testSubject2 <- rbind(testSubject2, testSubject)

> testQuery2
GRanges object with 2 ranges and 0 metadata columns:
      seqnames           ranges strand
         <Rle>        <IRanges>  <Rle>
  [1]     chr1 [754738, 755255]      *
  [2]     chr1 [783146, 783802]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> testSubject2
    chr  start    end strand      name   gene.id isoform.id cdsStart cdsEnd exonCount                                 exonStarts
1: chr1 752751 755214      +    FAM87B    FAM87B  NR_103536   755214 755214         2                             752750,754102,
2: chr1 762971 794826      + LINC01128 LINC01128  NR_047519   794826 794826         6 762970,764382,783033,787306,788050,788770,
3: chr1 762971 794826      + LINC01128 LINC01128  NR_047521   794826 794826         5        762970,764382,787306,788050,788770,
                                     exonEnds
1:                             753582,755214,
2: 763155,764484,783186,787490,788146,794826,
3:        763155,764484,787490,788146,794826,

> testOverlap2 <-
+   goldmine(
+     query = testQuery2,
+     genes = testSubject2,
+     promoter = c(
+       NUMBER_BP_UPSTREAM_FROM_TSS_IN_PROMO .... [TRUNCATED] 
Computing gene models
Generating context annotation - genes
Generating genes report
Generating exon/intron overlap diagrams

> testOverlap2
$context
    chr  start    end width strand qrow promoter_per end3_per exon_per intron_per intergenic_per utr5_per utr3_per   call call_genes
1: chr1 754738 755255   518      *    1            0      100    92.08          0           7.92        0        0 3' end     FAM87B
2: chr1 783146 783802   657      *    2            0        0     6.24        100           0.00        0        0   exon  LINC01128
   overlapped_genes nearest_genes distance_to_nearest_gene                                                                           url
1:           FAM87B        FAM87B                        0 http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A754738-755255
2:        LINC01128     LINC01128                        0 http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A783146-783802

$genes
   qrow srow query.chr query.start query.end gene.symbol   gene.id isoform.id isoform.chr isoform.start isoform.end isoform.strand overlap.bp
1:    1    1      chr1      754738    755255      FAM87B    FAM87B  NR_103536        chr1        752751      755214              +        477
2:    2    2      chr1      783146    783802   LINC01128 LINC01128  NR_047519        chr1        762971      794826              +        657
3:    2    3      chr1      783146    783802   LINC01128 LINC01128  NR_047521        chr1        762971      794826              +        477
   query.overlap.per isoform.overlap.per noncoding Promoter
1:             92.08               19.36      TRUE        0
2:            100.00                2.06      TRUE        0
3:             92.08                2.06      TRUE        0
                                                                                      ExonIntron 3' End
1:                                                                    E1 (0), I1 (0), E2 (92.08)    100
2: E1 (0), I1 (0), E2 (0), I2 (0), E3 (6.24), I3 (93.91), E4 (0), I4 (0), E5 (0), I5 (0), E6 (0)      0
3:                      E1 (0), I1 (0), E2 (0), I2 (100), E3 (0), I3 (0), E4 (0), I4 (0), E5 (0)      0
                                                                             url
1: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A754738-755255
2: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A783146-783802
3: http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr1%3A783146-783802

$features
list()


> testOverlap$genes[, c(1:14)]
   qrow srow query.chr query.start query.end gene.symbol   gene.id isoform.id isoform.chr isoform.start isoform.end isoform.strand overlap.bp
1:    1    1      chr1      783146    783802   LINC01128 LINC01128  NR_047519        chr1        762971      794826              +        657
2:    1    2      chr1      783146    783802   LINC01128 LINC01128  NR_047521        chr1        762971      794826              +        657
   query.overlap.per
1:               100
2:               100

> testOverlap2$genes[, c(1:14)]
   qrow srow query.chr query.start query.end gene.symbol   gene.id isoform.id isoform.chr isoform.start isoform.end isoform.strand overlap.bp
1:    1    1      chr1      754738    755255      FAM87B    FAM87B  NR_103536        chr1        752751      755214              +        477
2:    2    2      chr1      783146    783802   LINC01128 LINC01128  NR_047519        chr1        762971      794826              +        657
3:    2    3      chr1      783146    783802   LINC01128 LINC01128  NR_047521        chr1        762971      794826              +        477
   query.overlap.per
1:             92.08
2:            100.00
3:             92.08
Warning messages:
1: In data.table(qrow = qrow, srow = srow, query.chr = query.chr, query.start = query.start,  :
  Item 13 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
2: In data.table(qrow = qrow, srow = srow, query.chr = query.chr, query.start = query.start,  :
  Item 14 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
> 

...wherein it appears that the second of the two results from goldmine appears to have some corrupted values for testOverlap2$genes[, 13:14] as compared to that of testOverlap$genes[, 13:14].

In particular in forming testOverlap2, it appears that the internal results for the two columns 13 and 14 (both of length 2) have been recycled in order to fit the result columns (both of length 3). Looking at the code further in the updated version of the method goldmine:::calcPercentOverlap, it appears that the erstwhile code #subject.gr <- reduce(subject.gr) has been reinstated - perhaps intended to fix a prior error.

Hope this helps.

Best,
Carl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants