Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subtract documentation #83

Open
gevro opened this issue Dec 28, 2023 · 6 comments
Open

subtract documentation #83

gevro opened this issue Dec 28, 2023 · 6 comments

Comments

@gevro
Copy link

gevro commented Dec 28, 2023

Hi,
subtract says that metadata columns of x are propagated, but that doesn't seem to be happening:
"The names and metadata columns on x are propagated to the result."

> gr2 <- GRanges("a:1-10",score=10)
> gr1 <- GRanges("a:2-9")
> subtract(gr2,gr1)
GRangesList object of length 1:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a         1      *
  [2]        a        10      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

I think either the documentation is wrong, or subtract code is wrong.

Thanks

@hpages
Copy link
Contributor

hpages commented Jan 19, 2024

You don't see them but the metadata columns are here:

library(GenomicRanges)

gr1 <- GRanges(c("a:1-10", "a:8-20"), score=10:11)
gr2 <- GRanges("a:2-9")
result <- subtract(gr1, gr2)

mcols(result)
# DataFrame with 2 rows and 1 column
#      score
#   <integer>
# 1        10
# 2        11

As explained in the man page (?subtract) the result is a GRangesList object with the metadata columns of the first argument (gr1) on it. The gotcha here is that we do NOT display the top-level metadata columns of a GRangesList object (there's no easy way to do that), only its inner metadata columns:

grl <- GRangesList(
    GENE1=GRanges("chr1", IRanges(11:15, 20), tx_id=1:5),
    GENE2=GRanges("chr2", IRanges(4:1, 10, tx_id=6:9))
)
mcols(grl)$gene_alias <- c("foo", "bar")

grl
# GRangesList object of length 2:
# $GENE1
# GRanges object with 5 ranges and 1 metadata column:
#       seqnames    ranges strand |     tx_id
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chr1     11-20      * |         1
#   [2]     chr1     12-20      * |         2
#   [3]     chr1     13-20      * |         3
#   [4]     chr1     14-20      * |         4
#   [5]     chr1     15-20      * |         5
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths
# 
# $GENE2
# GRanges object with 4 ranges and 1 metadata column:
#       seqnames    ranges strand |     tx_id
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chr2      4-10      * |         6
#   [2]     chr2      3-10      * |         7
#   [3]     chr2      2-10      * |         8
#   [4]     chr2      1-10      * |         9
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

mcols(grl)
# DataFrame with 2 rows and 1 column
#        gene_alias
#       <character>
# GENE1         foo
# GENE2         bar

Does this help?

@gevro
Copy link
Author

gevro commented Jan 19, 2024

Makes sense. But then why is the output of subtract of two Geanges a GRangesList instead of GRanges?

@hpages
Copy link
Contributor

hpages commented Jan 19, 2024

But then why is the output of subtract of two Geanges a GRangesList instead of GRanges?

Because each range in the 1st GRanges object can produce several fragments after the subtraction so we use a GRangesList representation to group the fragments that are coming from the same original range together. Like here where the subtraction breaks the first range in x in 2 fragments:

x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500"))
y <- GRanges(c("chr1:21-25", "chr1:38-150"))
z <- subtract(x, y)
z
# GRangesList object of length 3:
# $A
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1      1-20      *
#   [2]     chr1     26-37      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths
# 
# $B
# GRanges object with 0 ranges and 0 metadata columns:
#    seqnames    ranges strand
#       <Rle> <IRanges>  <Rle>
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths
# 
# $C
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrX     1-500      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

By doing this, the operation preserves positionality i.e. the i-th element in the output corresponds to the i-th element in the input (i.e. in the first GRanges object). Another way to describe this is to say that the output is parallel to the input, which is a nice property.

If you don't like or don't need the grouping then you can unlist() the result as shown in the examples from the man page:

unlist(z)
# GRanges object with 3 ranges and 0 metadata columns:
#     seqnames    ranges strand
#        <Rle> <IRanges>  <Rle>
#   A     chr1      1-20      *
#   A     chr1     26-37      *
#   C     chrX     1-500      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

But then you loose positionality and the top-level metadata (if the GRangesList object had any).

@gevro
Copy link
Author

gevro commented Jan 19, 2024

Understood. Thanks very much for explaining. The only issue is what if someone wanted a GRanges output while also preserving the metadata columns.

In a prior github issue you had proposed the below code, which does the above, but I was wondering whether the subtract function has a way to do this? Based on what you are saying, it seems not?

GRanges_subtract <- function(gr1, gr2, ignore.strand=FALSE){
    gr2 <- reduce(gr2, ignore.strand=ignore.strand)
    hits <- findOverlaps(gr1, gr2, ignore.strand=ignore.strand)
    ans <- psetdiff(gr1, extractList(gr2, as(hits, "IntegerList")))
    unlisted_ans <- unlist(ans, use.names=FALSE)
    mcols(unlisted_ans) <- extractROWS(mcols(gr1),Rle(seq_along(ans), lengths(ans)))
    unlist(setNames(relist(unlisted_ans, ans), names(gr1)))
}

@hpages
Copy link
Contributor

hpages commented Jan 19, 2024

In that case you need to manually propagate the top-level metadata columns of the GRangesList object to its unlisted form. This can be done with:

x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500"), score=11:13)
y <- GRanges(c("chr1:21-25", "chr1:38-150"))
z <- subtract(x, y)

z2 <- unlist(z)  
mcols(z2) <- mcols(z)[rep.int(seq_along(z), lengths(z)), , drop=FALSE]
z2
# GRanges object with 3 ranges and 1 metadata column:
#     seqnames    ranges strand |     score
#        <Rle> <IRanges>  <Rle> | <integer>
#   A     chr1      1-20      * |        11
#   A     chr1     26-37      * |        11
#   C     chrX     1-500      * |        13
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

This specialized version of unlist() could be factored in a dedicated helper function e.g.:

unlist_with_mcols <- function(x)
{
    stopifnot(is(x, "List"))
    ans <- unlist(x)
    mcols(ans) <- mcols(x)[rep.int(seq_along(x), lengths(x)), , drop=FALSE]
    ans
}

Using this, you'll get what you are after by doing just unlist_with_mcols(subtract(...)).

I should probably bake this functionality into subtract() itself by adding an unlist argument to it (would be FALSE by default to preserve the current behavior).

@gevro
Copy link
Author

gevro commented Jan 19, 2024

Yes, I think it would be helpful to have a parameter for subtract that allows to either unlist or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants