-
Notifications
You must be signed in to change notification settings - Fork 267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protein Vector object #2021
Protein Vector object #2021
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2021 +/- ##
==========================================
- Coverage 98.59% 98.48% -0.12%
==========================================
Files 182 182
Lines 30885 31059 +174
Branches 7512 7563 +51
==========================================
+ Hits 30452 30588 +136
- Misses 418 455 +37
- Partials 15 16 +1 ☔ View full report in Codecov by Sentry. |
I wonder if codecov doesn't like it when there are multiple files with the same name (i.e. If so, then the same logic would apply to |
@@ -246,6 +246,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to | |||
GFF3FormatError, | |||
EMBLFormatError, | |||
BIOMFormatError, | |||
EmbedFormatError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out that this wasn't even being imported -- hinting that we don't have tests that verify any of these errors.
Raising an issue, since this could be a blind spot when creating new file formats.
Waiting on feedback regarding codecov. Otherwise, this pull request is ready for review. Major changes since #2008
|
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Hi all, thanks for the comments. After carefully scrutinizing the type signatures, I realized that there is an edge case that could complicate the incorporating of non-sequence vector objects (i.e. molecular vectors). I've created an intermediate class called I realize that this is complicating the hierarchy : |
@@ -7,10 +7,28 @@ | |||
# ---------------------------------------------------------------------------- | |||
from skbio.sequence import Protein |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have an empty line above.
@@ -36,6 +54,25 @@ class ProteinEmbedding(SequenceEmbedding): | |||
-------- | |||
Protein | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This empty line is not needed.
has definites: True | ||
has stops: False | ||
-------------------------- | ||
0 ACDEFGHIKL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have an empty line before the end of docstring.
@@ -7,10 +7,28 @@ | |||
# ---------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the naming convention in scikit-bio, the name of this script file can be changed from _embedding.py
into _base.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily. There is no _base.py
in the stats folder for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. tree
and sequence
folders also don't have _base.py
. I tend to suggest that the codebase structure should be uniformized. But this can be left to a future discussion.
skbio/embedding/_embedding.py
Outdated
return data | ||
|
||
|
||
def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about to distances
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distances
is a good idea!
skbio/embedding/_embedding.py
Outdated
embedding_vectors : iterable of EmbeddingVector objects | ||
An iterable of EmbeddingVector objects, or objects that | ||
subclass EmbeddingVector. | ||
metric : str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metric
can be upgraded to support a user function. But it can be postponed to a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already supported, since beta_diversity
allows for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. Given that, you may update the docstring to reflect that.
|
||
@property | ||
def embedding(self): | ||
r""" The embedding tensor. """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "tensor" the right term here? I thought that a tensor is basically an array that can be stored in GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, tensor is the correct term. Terms are supersets of matrices / vectors
skbio/embedding/_embedding.py
Outdated
|
||
def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean', | ||
validate=True): | ||
r""" Convert an iterable of EmbeddingVector objects to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First sentence of a docstring should not exceed one line.
@mortonjt I see the logic of One thing to note, which is somehow relevant but shouldn't affect the design here: scikit-bio's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mortonjt I added some extra comments. I think I have completed a pass of the PR. Thank you!
regex_match, | ||
f"{dim_name} dimension: {shape}\n{indent}has gaps", | ||
) | ||
return rstr | ||
|
||
|
||
class Embedding(SkbioObject): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring of __int__
should be moved under the parent class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the parent class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes -- I was reading the entire _embedding.py script and noticed this.
skbio/embedding/_embedding.py
Outdated
See Also | ||
-------- | ||
Protein |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line may be skbio.sequence.Protein
, otherwise Sphinx won't render a hyperlink for it.
# protein sequence | ||
Protein(sequence) | ||
|
||
sequence = _validate_protein(sequence) | ||
super(ProteinEmbedding, self).__init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just do super().__init__(...)
. Same for other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does not work for me. I'm keeping this as is.
@@ -49,7 +67,13 @@ def __init__(self, embedding, ids, **kwargs): | |||
self._ids = np.array(ids) | |||
|
|||
def __str__(self): | |||
raise NotImplemented | |||
raise NotImplementedError("This method should be implemented by subclasses.") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be nice to have a shape
property of the Embedding
class. But this can be done after the PR is merged.
skbio/embedding/_embedding.py
Outdated
eigvals = eigvals, | ||
proportion_explained = eigvals / eigvals.sum(), | ||
samples=pd.DataFrame( | ||
u * np.sqrt(s), index=[str(sv) for sv in embedding_vectors]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double-check: Is u * np.sqrt(s)
the correct math here? I thought it should be u.dot(np.diag(s))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are math here is fine because diag
here is equivalent to column multiplication.
Its honestly a matter of personal preference. There are many version of SVD - either allocate eigvals to u, to v or split them in half. I personally like to split them in half for the sake of drawing biplots. But I'll go ahead and use your recommendation, since it is a more common usecase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thank you! I only knew the most basic form. If you want to adopt this, please also check that features
is correct. Btw, this case further suggests that we should have a pca
function added to the ordination
module, just to make the functionality complete and avoid re-write the current code.
@@ -302,7 +303,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to | |||
import_module("skbio.io.format.taxdump") | |||
import_module("skbio.io.format.sample_metadata") | |||
import_module("skbio.io.format.biom") | |||
import_module("skbio.io.format.embedding") | |||
import_module("skbio.io.format.embed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale of renaming embedding
and embed
. I recall that you mentioned somewhere but I forgot. "embed" is a verb in any situation, like "align", "ordinate", etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce confusion. Otherwise we have two different modules named the same thing.
Also .embed
is an easier file extension compared to embedding
@@ -13,6 +13,7 @@ | |||
:toctree: generated/ | |||
ProteinEmbedding | |||
ProteinVector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can consider adding the classes to skbio/__init__.py
, such that they can be used as from skbio import ProteinVector
, like other skbio classes do. That being said, I am not a big fan of that, because it causes confusion in hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not do this at this point. I agree with you, it could be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @qiyunzhu comments should be addressed. Lets see how the tests behave
regex_match, | ||
f"{dim_name} dimension: {shape}\n{indent}has gaps", | ||
) | ||
return rstr | ||
|
||
|
||
class Embedding(SkbioObject): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the parent class?
|
||
def bytes(self): | ||
r""" Bytes representation of string encoding""" | ||
seq = np.frombuffer(str(self).encode("ascii"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is already addressed @wasade
|
||
@property | ||
def embedding(self): | ||
r""" The embedding tensor. """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, tensor is the correct term. Terms are supersets of matrices / vectors
skbio/embedding/_embedding.py
Outdated
return data | ||
|
||
|
||
def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about to distances
?
skbio/embedding/_embedding.py
Outdated
embedding_vectors : iterable of EmbeddingVector objects | ||
An iterable of EmbeddingVector objects, or objects that | ||
subclass EmbeddingVector. | ||
metric : str, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already supported, since beta_diversity
allows for this
skbio/embedding/_embedding.py
Outdated
eigvals = eigvals, | ||
proportion_explained = eigvals / eigvals.sum(), | ||
samples=pd.DataFrame( | ||
u * np.sqrt(s), index=[str(sv) for sv in embedding_vectors]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are math here is fine because diag
here is equivalent to column multiplication.
Its honestly a matter of personal preference. There are many version of SVD - either allocate eigvals to u, to v or split them in half. I personally like to split them in half for the sake of drawing biplots. But I'll go ahead and use your recommendation, since it is a more common usecase.
# protein sequence | ||
Protein(sequence) | ||
|
||
sequence = _validate_protein(sequence) | ||
super(ProteinEmbedding, self).__init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does not work for me. I'm keeping this as is.
@@ -302,7 +303,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to | |||
import_module("skbio.io.format.taxdump") | |||
import_module("skbio.io.format.sample_metadata") | |||
import_module("skbio.io.format.biom") | |||
import_module("skbio.io.format.embedding") | |||
import_module("skbio.io.format.embed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reduce confusion. Otherwise we have two different modules named the same thing.
Also .embed
is an easier file extension compared to embedding
skbio/embedding/_embedding.py
Outdated
|
||
@property | ||
def embedding(self): | ||
r""" The embedding tensor. """ | ||
return self._embedding.reshape(1, -1) | ||
return np.expand_dims(self._embedding, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverting this, since the previous version was correct
Everything passes! Shall we merge? @mortonjt |
Please complete the following checklist:
I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the
licenses
directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.This pull request does not include code, documentation, or other content derived from external source(s).
Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.
There are a few linting issues that need to be addressed -- the major modification here was updating the embedding format to handle both protein embeddings and protein vectors (vectors here will allow for fast queries), in addition to conversion utilities discussed in #2006
This is the last pull request that would need top be merged into skbio before we start putting together the tutorials.