Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein Vector object #2021

Merged
merged 19 commits into from
May 24, 2024
Merged

Protein Vector object #2021

merged 19 commits into from
May 24, 2024

Conversation

mortonjt
Copy link
Collaborator

@mortonjt mortonjt commented May 6, 2024

Please complete the following checklist:

  • I have read the contribution guidelines.

  • I have documented all public-facing changes in the changelog.

  • This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.

    • It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
  • This pull request does not include code, documentation, or other content derived from external source(s).

Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

There are a few linting issues that need to be addressed -- the major modification here was updating the embedding format to handle both protein embeddings and protein vectors (vectors here will allow for fast queries), in addition to conversion utilities discussed in #2006

This is the last pull request that would need top be merged into skbio before we start putting together the tutorials.

Copy link

codecov bot commented May 6, 2024

Codecov Report

Attention: Patch coverage is 77.31481% with 49 lines in your changes are missing coverage. Please review.

Project coverage is 98.48%. Comparing base (e3a27f5) to head (a9e03cb).
Report is 1 commits behind head on main.

Current head a9e03cb differs from pull request most recent head df9d010

Please upload reports for the commit df9d010 to get more accurate results.

Files Patch % Lines
skbio/embedding/_embedding.py 55.12% 35 Missing ⚠️
skbio/io/format/embed.py 83.63% 8 Missing and 1 partial ⚠️
skbio/embedding/_protein.py 89.65% 3 Missing ⚠️
skbio/io/format/tests/test_embed.py 96.22% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2021      +/-   ##
==========================================
- Coverage   98.59%   98.48%   -0.12%     
==========================================
  Files         182      182              
  Lines       30885    31059     +174     
  Branches     7512     7563      +51     
==========================================
+ Hits        30452    30588     +136     
- Misses        418      455      +37     
- Partials       15       16       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mortonjt
Copy link
Collaborator Author

I wonder if codecov doesn't like it when there are multiple files with the same name (i.e. skbio.io.format._embedding.py and skbio.embedding._embedding). Renaming these files to confirm.

If so, then the same logic would apply to skbio.embedding._protein since there is also skbio.sequence._protein.

@@ -246,6 +246,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to
GFF3FormatError,
EMBLFormatError,
BIOMFormatError,
EmbedFormatError,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out that this wasn't even being imported -- hinting that we don't have tests that verify any of these errors.
Raising an issue, since this could be a blind spot when creating new file formats.

@mortonjt mortonjt changed the title WIP : Protein Vector object Protein Vector object May 13, 2024
@mortonjt
Copy link
Collaborator Author

Waiting on feedback regarding codecov. Otherwise, this pull request is ready for review.

Major changes since #2008

  1. Adding in support for ProteinVector objects. This will allow for consuming the output of pLMs like TMvec for rapid remote homology search and light-weight protein function annotation. Furthermore, there are a number of conversion mechanisms in place to allow for easy manipulation of these objects.
  2. Renamed skbio.io.format.embedding -> skbio.io.format.embed for consistency. It also helps distinguish between the file format embed and the object embedding
  3. We defined a bytes() ABC method to further simplify hdf5 conversion. That way, every class inherienting from Embedding already has bytes() defined and can write to HDF5
  4. We added additional flexibility to the HDF5 format to allow for more relaxes string length requirements. In the previous PR, the string length was required to be equal to the embedding length (which is the case with the ProteinEmbedding object). But this is not the case with the ProteinVector object and other potential downstream objects (i.e. MolecularEmbedding objects). That being said, we have enabled backwards compatibility, since the increased flexibility slightly increases the HDF5 memory footprint (by adding extra pointers).
  5. Fixed exception import that was missed in the previous pull request

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/tests/test_embedding.py Show resolved Hide resolved
skbio/io/format/embed.py Show resolved Hide resolved
skbio/io/format/embed.py Outdated Show resolved Hide resolved
skbio/io/format/embed.py Outdated Show resolved Hide resolved
skbio/io/format/embed.py Outdated Show resolved Hide resolved
mortonjt and others added 4 commits May 14, 2024 12:50
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
skbio/embedding/_protein.py Outdated Show resolved Hide resolved
@mortonjt
Copy link
Collaborator Author

Hi all, thanks for the comments.

After carefully scrutinizing the type signatures, I realized that there is an edge case that could complicate the incorporating of non-sequence vector objects (i.e. molecular vectors).

I've created an intermediate class called EmbeddingVector that primarily serves to (1) address this future edge case and (2) enabling better type checking of vector objects (i.e. checking to make sure that all conversion methods accept iterables of objects that subclass EmbeddingVector).

I realize that this is complicating the hierarchy : Embedding -> EmbeddingVector -> SequenceVector -> ProteinVector, but this will be the last change to the hierarchy, and will avoid short-circuiting ourselves if we wanted to add in additional types in the future.

@@ -7,10 +7,28 @@
# ----------------------------------------------------------------------------
from skbio.sequence import Protein
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have an empty line above.

skbio/embedding/_protein.py Outdated Show resolved Hide resolved
skbio/embedding/_protein.py Outdated Show resolved Hide resolved
@@ -36,6 +54,25 @@ class ProteinEmbedding(SequenceEmbedding):
--------
Protein
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This empty line is not needed.

has definites: True
has stops: False
--------------------------
0 ACDEFGHIKL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have an empty line before the end of docstring.

@@ -7,10 +7,28 @@
# ----------------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the naming convention in scikit-bio, the name of this script file can be changed from _embedding.py into _base.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. There is no _base.py in the stats folder for instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. tree and sequence folders also don't have _base.py. I tend to suggest that the codebase structure should be uniformized. But this can be left to a future discussion.

return data


def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name is bit too long. How about abbreviate distance_matrix as distmat? This should be a commonly used abbreviation. See here and here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about to distances?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distances is a good idea!

embedding_vectors : iterable of EmbeddingVector objects
An iterable of EmbeddingVector objects, or objects that
subclass EmbeddingVector.
metric : str, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric can be upgraded to support a user function. But it can be postponed to a future PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already supported, since beta_diversity allows for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. Given that, you may update the docstring to reflect that.


@property
def embedding(self):
r""" The embedding tensor. """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "tensor" the right term here? I thought that a tensor is basically an array that can be stored in GPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, tensor is the correct term. Terms are supersets of matrices / vectors


def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean',
validate=True):
r""" Convert an iterable of EmbeddingVector objects to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sentence of a docstring should not exceed one line.

@qiyunzhu
Copy link
Contributor

qiyunzhu commented May 23, 2024

@mortonjt I see the logic of EmbeddingVector and I like that. I wonder if this term should be called "vector embedding" instead? Update: Never mind about this.

One thing to note, which is somehow relevant but shouldn't affect the design here: scikit-bio's Sequence only support ASCII characters, which is a big limitation (e.g., you can't easily represent methylated nucleotides), but also grants performance since all underlying data are np.uint8. In comparison, Biotite's sequence can contain any custom values (any immutable and hashable)

Copy link
Contributor

@qiyunzhu qiyunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mortonjt I added some extra comments. I think I have completed a pass of the PR. Thank you!

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
regex_match,
f"{dim_name} dimension: {shape}\n{indent}has gaps",
)
return rstr


class Embedding(SkbioObject):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring of __int__ should be moved under the parent class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the parent class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I was reading the entire _embedding.py script and noticed this.

See Also
--------
Protein
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line may be skbio.sequence.Protein, otherwise Sphinx won't render a hyperlink for it.

# protein sequence
Protein(sequence)

sequence = _validate_protein(sequence)
super(ProteinEmbedding, self).__init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just do super().__init__(...). Same for other cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not work for me. I'm keeping this as is.

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
@@ -49,7 +67,13 @@ def __init__(self, embedding, ids, **kwargs):
self._ids = np.array(ids)

def __str__(self):
raise NotImplemented
raise NotImplementedError("This method should be implemented by subclasses.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be nice to have a shape property of the Embedding class. But this can be done after the PR is merged.

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
eigvals = eigvals,
proportion_explained = eigvals / eigvals.sum(),
samples=pd.DataFrame(
u * np.sqrt(s), index=[str(sv) for sv in embedding_vectors]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double-check: Is u * np.sqrt(s) the correct math here? I thought it should be u.dot(np.diag(s)).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are math here is fine because diag here is equivalent to column multiplication.

Its honestly a matter of personal preference. There are many version of SVD - either allocate eigvals to u, to v or split them in half. I personally like to split them in half for the sake of drawing biplots. But I'll go ahead and use your recommendation, since it is a more common usecase.

Copy link
Contributor

@qiyunzhu qiyunzhu May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you! I only knew the most basic form. If you want to adopt this, please also check that features is correct. Btw, this case further suggests that we should have a pca function added to the ordination module, just to make the functionality complete and avoid re-write the current code.

@@ -302,7 +303,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to
import_module("skbio.io.format.taxdump")
import_module("skbio.io.format.sample_metadata")
import_module("skbio.io.format.biom")
import_module("skbio.io.format.embedding")
import_module("skbio.io.format.embed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale of renaming embedding and embed. I recall that you mentioned somewhere but I forgot. "embed" is a verb in any situation, like "align", "ordinate", etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce confusion. Otherwise we have two different modules named the same thing.
Also .embed is an easier file extension compared to embedding

@@ -13,6 +13,7 @@
:toctree: generated/
ProteinEmbedding
ProteinVector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can consider adding the classes to skbio/__init__.py, such that they can be used as from skbio import ProteinVector, like other skbio classes do. That being said, I am not a big fan of that, because it causes confusion in hierarchy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not do this at this point. I agree with you, it could be confusing.

Copy link
Collaborator Author

@mortonjt mortonjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @qiyunzhu comments should be addressed. Lets see how the tests behave

regex_match,
f"{dim_name} dimension: {shape}\n{indent}has gaps",
)
return rstr


class Embedding(SkbioObject):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the parent class?


def bytes(self):
r""" Bytes representation of string encoding"""
seq = np.frombuffer(str(self).encode("ascii"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is already addressed @wasade


@property
def embedding(self):
r""" The embedding tensor. """
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, tensor is the correct term. Terms are supersets of matrices / vectors

skbio/embedding/_embedding.py Show resolved Hide resolved
skbio/embedding/_embedding.py Outdated Show resolved Hide resolved
return data


def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about to distances?

embedding_vectors : iterable of EmbeddingVector objects
An iterable of EmbeddingVector objects, or objects that
subclass EmbeddingVector.
metric : str, optional
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already supported, since beta_diversity allows for this

eigvals = eigvals,
proportion_explained = eigvals / eigvals.sum(),
samples=pd.DataFrame(
u * np.sqrt(s), index=[str(sv) for sv in embedding_vectors]),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are math here is fine because diag here is equivalent to column multiplication.

Its honestly a matter of personal preference. There are many version of SVD - either allocate eigvals to u, to v or split them in half. I personally like to split them in half for the sake of drawing biplots. But I'll go ahead and use your recommendation, since it is a more common usecase.

# protein sequence
Protein(sequence)

sequence = _validate_protein(sequence)
super(ProteinEmbedding, self).__init__(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not work for me. I'm keeping this as is.

@@ -302,7 +303,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to
import_module("skbio.io.format.taxdump")
import_module("skbio.io.format.sample_metadata")
import_module("skbio.io.format.biom")
import_module("skbio.io.format.embedding")
import_module("skbio.io.format.embed")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce confusion. Otherwise we have two different modules named the same thing.
Also .embed is an easier file extension compared to embedding


@property
def embedding(self):
r""" The embedding tensor. """
return self._embedding.reshape(1, -1)
return np.expand_dims(self._embedding, 0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverting this, since the previous version was correct

@qiyunzhu
Copy link
Contributor

Everything passes! Shall we merge? @mortonjt

@qiyunzhu qiyunzhu merged commit 541a930 into scikit-bio:main May 24, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants