Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient sequence alignment path data structure #1974

Open
qiyunzhu opened this issue Mar 17, 2024 Discussed in #1973 · 1 comment
Open

Efficient sequence alignment path data structure #1974

qiyunzhu opened this issue Mar 17, 2024 Discussed in #1973 · 1 comment
Assignees

Comments

@qiyunzhu
Copy link
Contributor

Discussed in #1973

@mortonjt @wasade Will appreciate your thoughts!

Originally posted by qiyunzhu March 17, 2024
Here I am describing the plan to implement a multiple and pairwise sequence alignment path data structure in scikit-bio. In summary:

  • Central idea: run-length encoding (RLE) and bit array.
  • It is detached from the original sequences.
  • It is highly memory efficient, compared with tabular alignments.
  • It permits fully vectorized operations, therefore time efficient.
  • It can be efficiently converted to/from various formats, such as CIGAR (supported by many alignment libraries), tabular (original format in scikit-bio), indices (Biotite), coordinates (Biopython), etc.

I guess there might be similiar implementations in the past, although I haven't come across them through limited literature search. Will appreciate it if anyone can inform me of the relevant literature.

A Jupyter notebook implementing a prototype of this data structure and relevant functionality is attached.
align_path.zip

Will appreciate any feedback from the team and community!


For example, we have a multiple sequence alignment (MSA) consisting of three aligned DNA sequences.

seq1: CGGTCGTAACGCGTA---CA
seq2: CAG--GTAAG-CATACCTCA
seq3: CGGTCGTCAC-TGTACACTA

Suggested solution

I suggest the following solution, which is essentially a combination of run-length encoding (RLE) and bit array. Through reasoning, comparison, and benchmarking, I concluded that it is the most suitable for the variety of target applications of scikit-bio.

We divide the alignment position-wise into segments based on gap status. Each segment represents a consecutive block that has a consistent gap status in all sequences. In this example, the alignment can be divided into seven segments.

seq1: CGG TC GTAAC G CGTA --- CA
seq2: CAG -- GTAAG - CATA CCT CA
seq3: CGG TC GTCAC - TGTA CAC TA

Calculate the lengths of segments:

lens: 3 2 5 1 4 3 2

And the gap status of segments (1 - gap, 0 - character):

gap1: 0 0 0 0 0 1 0
gap2: 0 1 0 1 0 0 0
gap3: 0 0 0 1 0 0 0

The gap status can be encoded into bits:

gaps: 0 2 0 6 0 1 0

Therefore, we only need two equal-length vectors: lengths, and gap status (bits), to represent an alignment path.

[3 2 5 1 4 3 2]
[0 2 0 6 0 1 0]

Data type

The length vector can be a simple integer array.

The gap status vector is tricky. One can only pack up to 64 bits into a uint64 integer, which limits the number of sequences in the alignment. Therefore, one needs to use np.packbits to convert the 0-1 array into a 2D uint8 array. When the gap status needs to be accessed, one needs to use np.unpackbits to extract 0-1 values from the 2D array. This method ensures efficient storage, but creates some overhead in packing/unpacking.

Pairwise alignment

Pairwise alignment (i.e., aligning two sequences) is probably the more important use case than aligning three or more sequences. A notable application is sequence similarity search, which is essential in working with high-throughput sequencing data. I noticed that the majority of alignment libraries only support pairwise alignment.

The suggested solution is the most efficient with pairwise alignments. It is consistent with CIGAR, the most commonly used representation of pairwise alignment. According to the SAM specification:

Int Code  Meaning
 0   M    (mis)match (no gap)
 1   I    insertion (gap in query)
 2   D    deletion (gap in reference)
...
  • Note: In pairwise alignment, seq1 and seq2 are usually referred to as query and reference (or target, subject).

In the aforementioned example, if we only consider the first two sequences:

seq1: CGGTCGTAACGCGTA---CA
seq2: CAG--GTAAG-CATACCTCA

The CIGAR string is:

3M2D5M1D4M3I2M

The suggested solution is:

[3 2 5 1 4 3 2]
[0 2 0 2 0 1 0]

Some libraries use a similar data structure. For example, in pysam, a pairwise alignment is stored as cigartuples, which a list of tuples of (length, status) (i.e., the transpose of the suggestion solution).

It is beneficial to have a separate PairAlignPath class, which is a subclass of AlignPath. This subclass has additional methods (such as from_cigar, to_cigar), and has further optimized parsers (since it doesn't need to pack/unpack bits).

This subclass will serve as the bridge with multiple pairwise alignment libraries, such as Parasail and pyWFA.

Applications

The suggested solution may facilitate some downstream applications.

Alignment score

Calculation of alignment scores is a basic operation. Many alignment libraries do not have the capability of doing this for a user-provided alignment. Therefore, having this functionality in scikit-bio will create a platform for evaluating and benchmarking alignment methods.

The suggested solution is suitable for alignment score calculation. Most importantly, gap penalties can be conveniently calculated based on the lengths of gap segments, without even knowing the original sequences. Various gap penalty schemes (linear, affine, dual affine, convex...) can be calculated based on this information.

Meanwhile, match/mismatch scores can be calculated by converting the alignment path into the indices of characters in the original sequences that are part of the aligned regions (i.e., ignoring unaligned segments). Although instraightforward (and less performant than directly operating on the tabular MSA), this can still be vectorized, therefore its performance is okay.

def align_score(path, seqs):
    gap_segs = path.states != 0
    n_gap_opens = gap_segs.sum()
    n_gap_extends = (gap_segs * path.lengths).sum()

    idx = path.to_indices(gap="del")  # see below
    seq1, seq2 = (x._bytes[i] for x, i in zip(seqs, idx))
    n_matches = (seq1 == seq2).sum()
    n_mismatches = seq1.size - n_matches
    ...

In my tests, this function outperforms a Python for loop on aligned sequences when the sequences are long.

Multiple alignment

Multiple sequence alignment is computationally difficult (an exact solution is exponential to sequence count). Therefore, modern algorithms rely on various heuristics to accelerate this process. The most commonly used method, progressive alignment, is a greedy algorithm that iteratively merges smaller alignments into larger ones. The smaller alignments will remain unchanged after they were constructed. That is, "once a gap, always a gap". (Although modern algorithms will refine gaps after progressive alignments.)

I envision (guess) that progressive alignment may benefit from the suggestion solution, as it encodes gaps in separation from the original sequences, making it convenient to add gaps on top of gaps. However, I have not yet tested this.

Format conversions

Tabular MSA

scikit-bio's current data structure, TabularMSA, stores gapped sequences in an alignment. This is probably the most intuitive representation, despite being highly inefficient.

['CGGTCGTAACGCGTA---CA',
 'CAG--GTAAG-CATACCTCA',
 'CGGTCGTCAC-TGTACACTA']

The suggested solution can convert from/to a tabular structure using efficient vectorized operations (assuming msa is a 2D array of bytes).

def from_tabular():
    bits = (msa == gap_char)
    idx = np.where(np.diff(bits, prepend=np.nan))[0]
    lens, gaps = np.diff(idx, append=msa.shape[1]), bits[idx]
def to_tabular():
    gaps = np.repeat(bits, lens, axis=1)
    msa = np.full(bits.shape, gap_char, dtype=np.uint8)
    msa[gaps == 0] = np.concatenate(seqs)

Indices

Biotite's Alignment object contains sequences, trace, and score. This design also separate original sequences and the alignment path ("trace"). For the trace, the documentation says:

The trace is a (m x n) ndarray with alignment length m and sequence count n. Each element of the trace is the index in the corresponding sequence. A gap is represented by the value -1.

In the aforementioned example, the trace (after transposing) is:

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, -1, -1, -1, 15, 16],
 [0, 1, 2, -1, -1, 3, 4, 5, 6, 7, -1, 8, 9, 10, 11, 12, 13, 14, 15, 16],
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, -1, 10, 11, 12, 13, 14, 15, 16, 17, 18]]

This data structure facilitates efficient vectorized operations, despite being less efficient in memory space (it consumes more space than the original sequences).

It is convenient to convert between the suggested solution and the Biotite Alignment:

def to_indices():
    pos = np.repeat(1 - bits, self.lengths, axis=1)
    idx = np.cumsum(pos, axis=1, dtype=int) - 1
    indices = np.where(pos, idx, -1)
def from_indices():
    bits = indices == -1
    ...

Coordinates

Biopython's Alignment object consists of sequences and coordinates. The coordinates define the positions of segments in the original sequences. Specifically, each coordinate is the index of the first character in a segment, if this sequence contains this segment, or equal to the previous coordinate, if not.

In the aforementioned example, the coordinates are:

[[0, 3, 5, 10, 11, 15, 15, 17],
 [0, 3, 3,  8,  8, 12, 15, 17],
 [0, 3, 5, 10, 10, 14, 17, 19]])

This data structure also employs segments (the same as in the suggested solution). Therefore, it is also memory efficient (but less so than the suggested solution. The coordinates rather then lengths make it more convenient to locate individual segments, which is an advantage over the suggested solution.

It is convenient to convert between the suggested solution and the BioPython Alignment:

def to_coordinates():
    coords = np.append(np.zeros((nrow, 1)), lens * (1 - bits), axis=1).cumsum(axis=1)
def from_coordinates():
    gaps = (diff := np.diff(coords)) == 0
    lens = diff[gaps.argmin(axis=0), np.arange(ncol)]

Limitation and alternatives

The suggested solution is highly efficient when the number of sequences is small. However, its memory efficiency deleteriotes when there are many aligned sequences, because the segments (in each of which all sequences must have a unique gap status) will become highly fragmental.

  • Note: Even if all segments are 1, the memory consumption of the suggested solution (which is bits) is still at most 1/8 of the tabular MSA (which is bytes).

Separate RLE

In such scenarios, separate run-length encodings for individual sequences is a more memory efficient data structure. For example, the alignment:

seq1: CGGTCGTAACGCGTA---CA
seq2: CAG--GTAAG-CATACCTCA
seq3: CGGTCGTCAC-TGTACACTA

Can be represented as three vectors, in each of which numbers represent lengths of chars, gaps, chars, gaps, ...:

[15, 3, 2]
[3, 2, 5, 1, 9]
[10, 1, 9]

This data structure can be efficiently generated using the following code:

[np.diff(np.where(np.diff(x == gap_char, prepend=np.nan))[0], append=msa.shape[1]) for x in msa]

This data structure, despite memory efficient, may not support the efficient segment-based operations as discussed above. Therefore, it is only good for storage. Whether scikit-bio's target applications require storing large alignments in the memory is to be discussed.

Tree-guided states

If all sequences in an alignment can be related using a tree structure, then one can have one bit array at each node to represent the entire alignment path. This idea is aligned with the established algorithms for solving the multiple sequence alignment problem, particularly progressive alignment. These bit arrays are guaranteed to be not fragmental. Such a data structure may simultaneously suffice efficient memory and efficient operations (but with limitations).

I haven't prototyped this data structure. Because it is too specific, I doubt if it is the desired data structure for scikit-bio.

Summary

The new alignment path class AlignPath, structured as run-length encoding (RLE) plus bit array, will permit time and memory efficiency in applications, and serve as a central hub for conversion between various data structures.

@mortonjt
Copy link
Collaborator

mortonjt commented Apr 17, 2024

A couple of thoughts on this

Progressive alignment : it is an interesting thought to define an alignment method that itself takes alignments as input. To circumvent the runtime issues that we discussed, I wonder if is possible decouple the scoring scheme from the DP procedure itself. With normal alignment procedures, you can pass in the blosum matrix ahead of time. I wonder if you could pass in a distance matrix so that you wouldn't have to compute the match scores within the DP procedure. See below (calling this new alignment structure BitMSA, since it seems to be a bit representation of TabularMSA

def align(x : BitMSA, y : BitMSA):
        dm = dissimilarity_matrix(x, y)  # compute match scores ahead of time
        return smith_waterman(x, y, blosum=dm)

I do really like this idea, since it could potentially be used to perform progressive alignment on a very large scale. There are going to be quite a few applications that could be realized with this new general data structure, so I think it is definitely worth fleshing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants