Efficient sequence alignment path data structure #2011

mataton · 2024-04-18T16:24:14Z

This PR will implement the sequence alignment path data structure as outlined in issue #1974.

Please complete the following checklist:

I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.
- It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
This pull request does not include code, documentation, or other content derived from external source(s).

Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

codecov · 2024-04-18T16:30:44Z

Codecov Report

Attention: Patch coverage is 97.75281% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 98.47%. Comparing base (541a930) to head (cdcf229).
Report is 1 commits behind head on main.

Files	Patch %	Lines
skbio/alignment/_path.py	95.47%	8 Missing and 1 partial ⚠️
skbio/alignment/_tabular_msa.py	90.90%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2011      +/-   ##
==========================================
- Coverage   98.48%   98.47%   -0.02%     
==========================================
  Files         182      184       +2     
  Lines       31057    31502     +445     
  Branches     7563     7673     +110     
==========================================
+ Hits        30586    31021     +435     
- Misses        455      464       +9     
- Partials       16       17       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skbio/alignment/_path.py

qiyunzhu · 2024-05-22T05:55:06Z

skbio/alignment/_path.py

+                    cigar.append(str(length) + "D")
+                    idx1 += length
+                elif state == 3:
+                    cigar.append(str(length) + "P")


There is a shorter may that can replace the code between lines 332 to 354:

if state == 0: match_arr = ... else: cigar.append(str(length) + codes[state]) flipped = state ^ 3 idx1 += flipped & 1 and length idx2 += flipped & 2 and length

This method uses indexing and bitwise operations to avoid rewriting cigar.append and idx += for each condition. It is not necessarily faster than the current solution, nor is it more elegant (bitwise operation is hard to understand!). Therefore, it is only for reference.

skbio/alignment/_path.py

mortonjt

My comments have been addressed

skbio/alignment/_path.py

qiyunzhu · 2024-05-24T19:31:32Z

skbio/alignment/_path.py

            indices = np.asarray(indices)
            return cls.from_bits(
-                indices == gap, [x[np.argmax(x != gap)] for x in indices]
+                indices == gap,
+                indices[np.arange(indices.shape[0]), np.argmax(indices != gap, axis=1)],


Here, indices is compared against gap twice. You can further optimize by doing it only once.

bits = indices == gap starts = indices[np.arange(indices.shape[0]), np.argmin(bits, axis=1)] return cls.from_bits(bits, starts)

skbio/alignment/tests/test_path.py

qiyunzhu · 2024-05-29T06:55:11Z

skbio/alignment/_path.py

+        >>> from skbio.alignment import AlignPath
+        >>> path = AlignPath(lengths=[1, 2, 2, 1],
+        ...                  states=[0, 5, 2, 6],
+        ...                  gaps=[0, 0, 0])


Should gaps be starts?

qiyunzhu · 2024-05-29T17:27:05Z

@mataton Thanks! Let's merge.

initial commit

a02511e

mataton added 28 commits April 18, 2024 09:31

add path function to tabularMSA and create util file

233d663

Add _util.py file

e58f266

Add basic to_cigar and from_cigar functions

656ef92

Start unit tests for PairAlignPath

29ff70c

Add ability to handle = or X to from_cigar function

7e00f4c

Add code attribution for part taken from SO

1279233

Remove print statement

c197be1

Initial version of handling match vs mismatch for to_cigar

f79479a

Split encoding into separate function

8f13744

Change input name

2a8b439

Add test data

169cb22

Start on unit tests

47ff552

Update fix_arrays function

762af17

Change from np.nan to 0 for append in fix_arrays

859750e

Numpy version of run_length_encode

020620b

Fix fix_arrays function

0eaa650

Enable from_cigar to handle strings with or without ones

31da14a

Add error handling and tests for from_bits in PairAlign

d5d9252

PairAlignPath fully covered

7488c37

Expand unit tests

61327ca

Test more than 8 seqs for from_bits

5f38384

Merge branch 'main' into alignment

8d6c66a

To_indices tests

d907935

Complete coverage for to_indices

61b8ce6

Full coverage

1a5d7b9

Merge branch 'main' into alignment

271954e

Update init file

24ddce8

Added non default gap character handling to from_tabular

4f28f68

Update CHANGELOG

93bdd8a

qiyunzhu reviewed May 21, 2024

View reviewed changes

Address most recent comments

8e1e22a

qiyunzhu reviewed May 22, 2024

View reviewed changes

skbio/alignment/_path.py Outdated Show resolved Hide resolved

Move mapping and switch to unsigned int for starts

52e0118

mortonjt approved these changes May 22, 2024

View reviewed changes

mataton added 4 commits May 22, 2024 13:04

Rename mapping and codes

e2e1eb6

Create class properties for states, starts, lengths, and shapes

99f05cf

Switch ValueError to TypeError where appropriate

e5d6cf1

Paired programming additions

a8d2b8d

qiyunzhu reviewed May 23, 2024

View reviewed changes

skbio/alignment/_path.py Outdated Show resolved Hide resolved

mataton added 2 commits May 24, 2024 11:38

Update to/from_indices functionality to handle starts

91ba504

Update to/from_coordinates functionality to handle starts

8e6c8c0

qiyunzhu reviewed May 24, 2024

View reviewed changes

mataton added 9 commits May 24, 2024 13:46

Lint tests

9e4330a

Remove unused import

b9059a0

Start on docstring examples

05db713

AlignPath docstring

98c2a6d

Add examples to docstrings

78e873c

Add example text

8f0544d

Merge branch 'main' into alignment

f5b7e5a

More examples

ca030b3

Final examples

3652358

qiyunzhu reviewed May 29, 2024

View reviewed changes

mataton added 2 commits May 29, 2024 09:50

Change gaps to starts

fd78816

Fix doctests

cdcf229

qiyunzhu merged commit c1d6c18 into scikit-bio:main May 29, 2024
29 checks passed

mataton deleted the alignment branch June 4, 2024 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient sequence alignment path data structure #2011

Efficient sequence alignment path data structure #2011

mataton commented Apr 18, 2024 •

edited

codecov bot commented Apr 18, 2024 •

edited

qiyunzhu May 22, 2024

mortonjt left a comment

qiyunzhu May 24, 2024

qiyunzhu May 29, 2024

qiyunzhu commented May 29, 2024

Efficient sequence alignment path data structure #2011

Efficient sequence alignment path data structure #2011

Conversation

mataton commented Apr 18, 2024 • edited

codecov bot commented Apr 18, 2024 • edited

Codecov Report

qiyunzhu May 22, 2024

Choose a reason for hiding this comment

mortonjt left a comment

Choose a reason for hiding this comment

qiyunzhu May 24, 2024

Choose a reason for hiding this comment

qiyunzhu May 29, 2024

Choose a reason for hiding this comment

qiyunzhu commented May 29, 2024

mataton commented Apr 18, 2024 •

edited

codecov bot commented Apr 18, 2024 •

edited