-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient sequence alignment path data structure #2011
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2011 +/- ##
==========================================
- Coverage 98.48% 98.47% -0.02%
==========================================
Files 182 184 +2
Lines 31057 31502 +445
Branches 7563 7673 +110
==========================================
+ Hits 30586 31021 +435
- Misses 455 464 +9
- Partials 16 17 +1 ☔ View full report in Codecov by Sentry. |
cigar.append(str(length) + "D") | ||
idx1 += length | ||
elif state == 3: | ||
cigar.append(str(length) + "P") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a shorter may that can replace the code between lines 332 to 354:
if state == 0:
match_arr = ...
else:
cigar.append(str(length) + codes[state])
flipped = state ^ 3
idx1 += flipped & 1 and length
idx2 += flipped & 2 and length
This method uses indexing and bitwise operations to avoid rewriting cigar.append
and idx +=
for each condition. It is not necessarily faster than the current solution, nor is it more elegant (bitwise operation is hard to understand!). Therefore, it is only for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments have been addressed
indices = np.asarray(indices) | ||
return cls.from_bits( | ||
indices == gap, [x[np.argmax(x != gap)] for x in indices] | ||
indices == gap, | ||
indices[np.arange(indices.shape[0]), np.argmax(indices != gap, axis=1)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, indices
is compared against gap
twice. You can further optimize by doing it only once.
bits = indices == gap
starts = indices[np.arange(indices.shape[0]), np.argmin(bits, axis=1)]
return cls.from_bits(bits, starts)
skbio/alignment/_path.py
Outdated
>>> from skbio.alignment import AlignPath | ||
>>> path = AlignPath(lengths=[1, 2, 2, 1], | ||
... states=[0, 5, 2, 6], | ||
... gaps=[0, 0, 0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should gaps
be starts
?
@mataton Thanks! Let's merge. |
This PR will implement the sequence alignment path data structure as outlined in issue #1974.
Please complete the following checklist:
I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the
licenses
directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.This pull request does not include code, documentation, or other content derived from external source(s).
Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.