Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could GenomicRanges::binnedAverage() be reused for other purposes? #77

Open
ssayols opened this issue Mar 23, 2023 · 4 comments
Open

Could GenomicRanges::binnedAverage() be reused for other purposes? #77

ssayols opened this issue Mar 23, 2023 · 4 comments

Comments

@ssayols
Copy link

ssayols commented Mar 23, 2023

Hi Herve,
this is more of a suggestion rather than a bug.
Would it make sense to make the binnedAverage() function more general, in a way that it could compute more than just the mean? If I understand the code, it's relatively straightforward to call any function in IRanges::view*().
Something like this (please notice the new parameter at the end of the header. Defaults to IRanges::viewMeans to mimic the behavior of binnedAveage()):

binnedView <- function(bins, numvar, varname, na.rm=FALSE, fun=IRanges::viewMeans) {
    if (!is(bins, "GRanges")) 
        stop("'x' must be a GRanges object")
    if (!is(numvar, "RleList")) 
        stop("'numvar' must be an RleList object")
    if (!identical(seqlevels(bins), names(numvar))) 
        stop("'seqlevels(bin)' and 'names(numvar)' must be identical")
    fun2 <- function(v, na.rm = FALSE) {
        if (!isTRUEorFALSE(na.rm)) 
            stop("'na.rm' must be TRUE or FALSE")
        result <- fun(v, na.rm = na.rm)
        w0 <- width(v)
        v1 <- trim(v)
        w1 <- width(v1)
        if (na.rm) {
            na_count <- sum(is.na(v1))
            w0 <- w0 - na_count
            w1 <- w1 - na_count
        }
        result <- result * w1/w0
        result[w0 != 0L & w1 == 0L] <- 0
        result
    }
    bins_per_chrom <- split(ranges(bins), seqnames(bins))
    result_list <- lapply(names(numvar), function(seqname) {
        v <- Views(numvar[[seqname]], bins_per_chrom[[seqname]])
        fun2(v, na.rm = na.rm)
    })
    new_mcol <- unsplit(result_list, as.factor(seqnames(bins)))
    mcols(bins)[[varname]] <- new_mcol
    bins
}

This would enable other ways of aggregating signal in bins (eg. by setting fun=IRanges::viewSums).

cheers,
Sergi

@cgroza
Copy link

cgroza commented Feb 2, 2024

I just tried to use this function to compute averages for 15 million bins.
It was taking hours to complete.
Meanwhile, my own solution with findOverlaps and dplyr did the same in 5 minutes.
I don't know why it's inefficient but I wouldn't rely on it too much.

@ssayols
Copy link
Author

ssayols commented Feb 2, 2024

I posted this function here without any warranty that works efficiently for all uses cases. Please keep in mind this is not part of GenomicRanges, nor am I a contributor of the package. Just a layman.

Nevertheless, GenomicRanges::binnedAverage() takes barely 2 seconds to compute the average signal for 15 million bins in my 6 years old laptop:

library(GenomicRanges)
bins <- unlist(tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg38::Hsapiens), ntile=15e6))
signal <- GRanges("chr1:1")   # a phony signal track
seqinfo(signal) <- seqinfo(bins)
system.time({
  avg <- binnedAverage(bins, coverage(signal), "average_coverage")
})
   user  system elapsed 
  2.217   0.251   2.471 

@ssayols
Copy link
Author

ssayols commented Feb 2, 2024

Btw I just tried the function above (binnedView()) to compute the average signal of 15M bins, and it also takes ~2 seconds to run:

system.time({
  avg <- binnedView(bins, coverage(signal), "average_coverage", fun=IRanges::viewMeans)
})

   user  system elapsed 
  2.213   0.124   2.338

Perhaps your data (or data structures) are innappropriate?

@cgroza
Copy link

cgroza commented Feb 2, 2024

I must be doing something wrong then. I am using an Rle object.
Do I have to sort the data beforehand to achieve this performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants