Specialize findOverlaps for GRangesFactor objects #28

LTLA · 2019-06-22T04:56:21Z

This should improve efficiency of the overlaps... which was the whole aim of this class in the first place.

hpages · 2019-06-23T08:00:58Z

Hmm, the findOverlaps(..., select="all") + selectHits() strategy is certainly going to slow down things a lot in some situations. This is because the findOverlaps#GenomicRanges#GenomicRanges method is highly optimized when select is "first", "last", or "arbitrary". In these cases the method collects at most 1 hit per query (and stores it directly in the integer vector to return) rather than collect all the hits in a Hits object to later drop most of them. In addition the length of the integer vector is known in advance so the vector can be pre-allocated whereas in the select="all" case the final size of the Hits object is not known in advance, which means that the object cannot be pre-allocated so has to be grown via re-allocations and copies:

library(GenomicRanges)
query <- GRanges("chr1", IRanges(1, 1:9500))
subject <- GRanges("chr1", IRanges(1:9500, 9500))
system.time(q2s <- findOverlaps(query, subject, select="arbitrary"))
#    user  system elapsed 
#   0.029   0.000   0.031 
system.time(hits <- findOverlaps(query, subject, select="all"))
#    user  system elapsed 
#   2.948   0.407   3.355

The more number of hits per query (in average), the worse select="all" will perform with respect to select="first", "last", or "arbitrary".

The select="arbitrary" case is the workhorse behind overlapsAny(), %over%, and %within%.

LTLA · 2019-06-23T16:50:37Z

Well, I can't say it was easy, but select!="all" optimizations are done. Note that the lack of special behaviour for a GRF subject when select="arbitrary" is deliberate; I'd have to unique the indices anyway to ensure that the query doesn't select a range that isn't used.

LTLA · 2019-09-04T04:43:25Z

Nudge.

LTLA added 6 commits June 21, 2019 21:34

Added specialized findOverlaps methods for GRangesFactors.

59a0302

Tested new GRF findOverlaps methods.

4b77041

Mentioned new methods in docs.

72efa6a

Allow efficient bypass for small Factors with many levels.

c2b519a

Accommodate other choices of select=.

a572e80

Allow GRFs to overlap GRLs.

ddccec8

hpages self-assigned this Jun 23, 2019

LTLA added 2 commits June 23, 2019 09:46

Avoid creating Hits for efficiency when select!='all'.

2a4b8db

Minor testfixes.

6a40481

Merge branch 'master' of https://github.com/Bioconductor/GenomicRanges

6dd1ec5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialize findOverlaps for GRangesFactor objects #28

Specialize findOverlaps for GRangesFactor objects #28

LTLA commented Jun 22, 2019

hpages commented Jun 23, 2019

LTLA commented Jun 23, 2019

LTLA commented Sep 4, 2019

Specialize findOverlaps for GRangesFactor objects #28

Are you sure you want to change the base?

Specialize findOverlaps for GRangesFactor objects #28

Conversation

LTLA commented Jun 22, 2019

hpages commented Jun 23, 2019

LTLA commented Jun 23, 2019

LTLA commented Sep 4, 2019