Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strand-aware expand (bedtools slop) #144

Open
ivirshup opened this issue Apr 18, 2023 · 13 comments
Open

Strand-aware expand (bedtools slop) #144

ivirshup opened this issue Apr 18, 2023 · 13 comments

Comments

@ivirshup
Copy link
Contributor

Towards getting TSS regions from genomic annotations: it would be nice if expand could be strand aware.

Right now, expand only lets you choose "left", "right", or "both". I would like to be able for this to take into account strand direction, similar to bedtools slop's or bedtools flank's -s argument allows.

Related to:

@gfudenberg
Copy link
Member

thanks for the suggestion!

any suggestions on the API?

I wonder if we're not overloading some of the core bioframe functions with arguments. E.g. the current implementation of closest is pretty heavy with 15 arguments. Would it be more user-friendly to have a separate function expand_by_strand() in extras?

Relatedly, any thoughts on the balance between keeping the functions lean & encourage users to compose functions to achieve their goals, vs. adding this sort of functionality? (which could, e.g., also be implemented as two calls to expand).

@ivirshup
Copy link
Contributor Author

any suggestions on the API?

Probably a boolean flag, similar to bedtools.

Alternatively, more options for direction?

I wonder if we're not overloading some of the core bioframe functions with arguments.

I see your point with closest potentially having too many arguments. I personally don't like arguments whose value changes the type of the result (though am more okay with the type of the argument changing the type of the result).

However, I don't think this is an issue here. It's one argument and it's quite clear how it interacts with the other arguments.

As an argument from prior art: bedtools does have a pretty nice interface. I don't think you can go that wrong copying it.

thoughts on the balance between keeping the functions lean & encourage users to compose functions

Ha, tough question. I think it always requires a lot of thought. It's especially hard in a language like python or R where the language itself is slow, so we actually need to wrap quite a lot of operations in a small number of calls. Some relevant points here:

  • How simple is it to accomplish without this functionality?

I don't think this one is that easy to accomplish. My initial thought here using just the pandas and bioframe APIs would be: split by strand, apply expand to each, concatenate the results, and likely need to re-sort.

(Oh, you may also need to make sure you don't expand any intervals over chromosome bounds, but that might be a separate issue.)

  • How much more efficient can you make it?

The logic for doing this in a strand aware way could look something like:

from typing import Literal

from numba import njit
import numpy as np, pandas as pd


def expand(df: pd.DataFrame, amount: int, strand: Literal["+", "-"]) -> pd.DataFrame:
    result = df.copy()
    left, right = result["start"].values, result["end"].values
    result["start"], result["end"] = _expand(left, right, amount, (df["strand"] == strand).values)
    return result


@njit
def _expand(left, right, amount, strand):
    N = len(left)
    for i in range(N):
        if strand[i]:
            right[i] += amount
        else:
            left[i] -= amount
    return left, right

This is going to be much more efficient that the pandas/ bioframe solution I thought up.

@nvictus
Copy link
Member

nvictus commented May 8, 2023

Discussing this afternoon, we see this as an opportunity to come up with a good semantics for strand-awareness, which was also recently added to closest.

Two proposals came up:

  1. It would be good to make upstream & left synonyms, as well as downstream and right.
  2. Replace the ignore_upstream and ignore_downstream boolean flags in closest with a controlled vocab argument of options: left/upstream, right/downstream, both, and something that designates a column of the dataframe, e.g. strand. Provide the name of the strand column in another argument.

This convention could apply to closest, expand and other functions.

P.S. Thank you for getting the ball rolling in #152!

@ivirshup
Copy link
Contributor Author

For (1): I also personally find left and right easier, but wonder how much that comes from natively speaking a language that reads left-to-right.

But also bedtools seems kinda inconsistent about this stuff, suggesting it's hard to decide. Could be worth asking the authors what they prefer these days?


I would like to pitch the other direction of (2): left/ right not as literal options, but as arguments (like bedtools slop). E.g:

def expand(
    df: pd.DataFrame,
    both: int | None = None,
    *,
    left: int | None = None,
    right: int | None = None,
    strand_aware: bool = False,
) -> pd.DataFrame:
    if both is not None and (left is not None or right is not None):
        raise ValueError("Cannot specify both and left/right")
    ...

This collapses it to one argument, instead of two and lets you do things like expand(df, left=5, right=10). For closest, this could look like:

def closest(
    df1, df2, *, left: bool = True, right: bool = True, ...
) -> pd.DataFrame:

P.S. Thank you for getting the ball rolling in #152!

Thanks to @emdann for coming up with the elegant solution!

@agalitsyna
Copy link
Member

agalitsyna commented May 15, 2023

  1. The solution above changes the names from "upstream/downstream" to "left/right", which is inconsistent across bioframe tools, e.g., in closest, we have this definition of upstream/downstream. The reasoning for separating downstream/upstream from left/right is that orientation is not the same as strandedness, and we probably want to keep that consistent across tools.

  2. Another confusion is that the values of "upstream/downstream/left/right" are more natural to be set as booleans and not integers with a single parameter if we want to set it to be universal across the tools. If these are integers, it needs to be clarified in what units they should be reported: the number of basepairs of the genome or the number of neighbors.

One of the solutions might be defining the "search space" for closest or "expansion space" for expected:

bioframe.closest(df_oriented, df, 
    overlaps: bool | True, # whether to expand the search space to overlapping regions
    upstream: bool | True, # -"- expand the search space to upstream
    downstream: bool | True, # -"- downstream
    direction_col=None)
bioframe.expand(df, 
    pad: int, # The amount by which the intervals are additively expanded.
    upstream: bool | True, # -"- expand to upstream
    downstream: bool | True, # -"- downstream
    direction_col=None
    )

@gfudenberg
Copy link
Member

gfudenberg commented May 15, 2023

another issue related to expand() came up in discussions today, namely:
what is the most intuitive behavior for expand() with the "scale" argument + "upstream" or "downstream"?

some ideas:

  • disallow "scale" without "both"
  • split into a separate function that is strand-unaware, e.g. scale()
  • scale from the midpoint of the interval, so if scale=2 and "upstream" then the interval is scaled to be twice as large with all the additional size of the interval is on the "upstream" side
  • keep current behavior (scale=2 and "upstream" scales the interval size by 1.5-fold, with additional size added on the left)

@ivirshup
Copy link
Contributor Author

ivirshup commented May 16, 2023

The solution above changes the names from "upstream/downstream" to "left/right",

upstream/ downstream is probably the better option. Just wish they were shorter names 😆

Another confusion is that the values of "upstream/downstream/left/right" are more natural to be set as booleans and not integers with a single parameter if we want to set it to be universal across the tools.

I don't think that it's super important that argument name is consistent with argument type across functions. Especially for a concept as deeply ingrained into the core data structure as upstream/ downstream, I think there will be many different kinds of values that could be appropriate for different functions.

what is the most intuitive behavior for expand() with the "scale" argument

Going a bit further on the idea of having upstream: int arguments – I had also thought it would be a natural extension if one could pass not only a single integer, but a vector of integers the same length as the input dataframe. So like:

# Expand upstream by half the gene length
bf.expand(genes, upstream=((genes["stop"] - genes["start"]) * 0.5).round())
Longer example: Cut out first exon
first_exon = (
    genes
    .join(exons, how="left", on="gene_id")
    .groupby("gene_id")
    .sort_values("start")
    .first()
) 
# Assuming first_exon is the same order as genes
bf.expand(
    genes,
    upstream=first_exon["start"] - first_exon["stop"],
)

This could cover the scale use case, and be much more flexible besides.

@nvictus
Copy link
Member

nvictus commented Apr 4, 2024

I hate to bikeshed more on this, but I think it's important to get this right, because semantics of "direction" (absolute/reference orientation or relative to a supplied orientation per record) and "orientation" (i.e. strandedness of a feature) can get extremely confusing for end users.

I think when speaking of the direction of an operation, e.g. the search direction of closest or the direction in which to expand an interval, there exist:

  • Absolute specifiers: e.g. left, right
  • Relative specifiers: I'm more partial to forward and reverse than upstream and downstream, but these could be synonyms. In absence of an declared orientation, "forward" can be taken to be "right" and "reverse" to be "left.

However, a nice alternative is to not use specifiers at all and accept a tuple (reverse, forward) to make a slightly modified version of @ivirshup's proposal for expand:

bioframe.expand(df, 5)  # symmetric
bioframe.expand(df, df["slop"])  # symmetric, vectorized

bioframe.expand(df, (6, 5))
bioframe.expand(df, (rev_array, fwd_array))

# single-sided
bioframe.expand(df, (rev, 0))
bioframe.expand(df, (0, fwd))

I suggest that the orientation that defines the direction of an operation not be called direction_col because it supplies orientation information. I propose something more intuitive like along=.

bioframe.expand(df, (6, 5), along="strand")

where the convention would be to map features with "strand" field value "+" to use (6 left, 5 right), features with "strand" field value "-" to use (6 right, 5 left), and features with any other strand value to be dropped.

The along= kwarg could also be vectorized to use a foreign array or column:

bioframe.expand(df, (rev_slops, fwd_slops), along=strands)

I would finally advocate to move the "scale" use case to a new function.

Thoughts?

Thanks to @manzt for the tuple suggestion!

@Phlya
Copy link
Member

Phlya commented Apr 4, 2024

features with "strand" field value "+" to use (3 left, 5 right), features with "strand" field value "-" to use (3 right, 5 left)

If by 5 and 3 you mean 5-prime and 3-prime ends, shouldn't it be the opposite? E.g. 5-prime on the positive strand is on the left.

@Phlya
Copy link
Member

Phlya commented Apr 4, 2024

Or is it just "random" numbers that confused me because they match the DNA ends? :)

@nvictus
Copy link
Member

nvictus commented Apr 4, 2024

Oh man, yeah, 5 and 3 were just random... Bad choice, low on caffeine :)
I'll change them.

@nvictus
Copy link
Member

nvictus commented Apr 4, 2024

Perhaps this has a nice mapping to closest where k could be a single value or tuple and paired with along=.

@nvictus
Copy link
Member

nvictus commented Apr 7, 2024

I've updated @emdann's draft to introduce a new function bioframe.shift to perhaps replace expand.

Like the proposal above, shift takes a single or a pair of shift arguments. As a pair, the first value refers to the "leading" bound and the second to the "trailing" bound of each interval (I'm avoiding the words "start" and "end" for obvious reasons). Unlike expand, shift interprets sign such that negative always means upstream and positive downstream.

That means that a single shift value will translate intervals by moving the leading and trailing bounds the same amount in the same direction, while pairs of opposing signs will shrink or expand intervals. The along parameter makes the operation strand-aware, which for "-" stranded features treats "end" as the leading bound, "start" as the trailing bound, and operates opposite the reference direction. Unstranded features are ignored, i.e. passed through unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants