Protein Vector object #2021

mortonjt · 2024-05-06T06:16:29Z

Please complete the following checklist:

I have read the contribution guidelines.
I have documented all public-facing changes in the changelog.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied.
- It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
This pull request does not include code, documentation, or other content derived from external source(s).

Note: This document may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

There are a few linting issues that need to be addressed -- the major modification here was updating the embedding format to handle both protein embeddings and protein vectors (vectors here will allow for fast queries), in addition to conversion utilities discussed in #2006

This is the last pull request that would need top be merged into skbio before we start putting together the tutorials.

codecov · 2024-05-06T06:20:37Z

Codecov Report

Attention: Patch coverage is 77.31481% with 49 lines in your changes are missing coverage. Please review.

Project coverage is 98.48%. Comparing base (e3a27f5) to head (a9e03cb).
Report is 1 commits behind head on main.

❗ Current head a9e03cb differs from pull request most recent head df9d010

Please upload reports for the commit df9d010 to get more accurate results.

Files	Patch %	Lines
skbio/embedding/_embedding.py	55.12%	35 Missing ⚠️
skbio/io/format/embed.py	83.63%	8 Missing and 1 partial ⚠️
skbio/embedding/_protein.py	89.65%	3 Missing ⚠️
skbio/io/format/tests/test_embed.py	96.22%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2021      +/-   ##
==========================================
- Coverage   98.59%   98.48%   -0.12%     
==========================================
  Files         182      182              
  Lines       30885    31059     +174     
  Branches     7512     7563      +51     
==========================================
+ Hits        30452    30588     +136     
- Misses        418      455      +37     
- Partials       15       16       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mortonjt · 2024-05-13T14:36:41Z

I wonder if codecov doesn't like it when there are multiple files with the same name (i.e. skbio.io.format._embedding.py and skbio.embedding._embedding). Renaming these files to confirm.

If so, then the same logic would apply to skbio.embedding._protein since there is also skbio.sequence._protein.

mortonjt · 2024-05-13T14:37:50Z

skbio/io/__init__.py

@@ -246,6 +246,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to
    GFF3FormatError,
    EMBLFormatError,
    BIOMFormatError,
+    EmbedFormatError,


It turned out that this wasn't even being imported -- hinting that we don't have tests that verify any of these errors.
Raising an issue, since this could be a blind spot when creating new file formats.

mortonjt · 2024-05-13T15:55:53Z

Waiting on feedback regarding codecov. Otherwise, this pull request is ready for review.

Major changes since #2008

Adding in support for ProteinVector objects. This will allow for consuming the output of pLMs like TMvec for rapid remote homology search and light-weight protein function annotation. Furthermore, there are a number of conversion mechanisms in place to allow for easy manipulation of these objects.
Renamed skbio.io.format.embedding -> skbio.io.format.embed for consistency. It also helps distinguish between the file format embed and the object embedding
We defined a bytes() ABC method to further simplify hdf5 conversion. That way, every class inherienting from Embedding already has bytes() defined and can write to HDF5
We added additional flexibility to the HDF5 format to allow for more relaxes string length requirements. In the previous PR, the string length was required to be equal to the embedding length (which is the case with the ProteinEmbedding object). But this is not the case with the ProteinVector object and other potential downstream objects (i.e. MolecularEmbedding objects). That being said, we have enabled backwards compatibility, since the increased flexibility slightly increases the HDF5 memory footprint (by adding extra pointers).
Fixed exception import that was missed in the previous pull request

skbio/embedding/_embedding.py

skbio/embedding/tests/test_embedding.py

skbio/io/format/embed.py

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

skbio/embedding/_embedding.py

skbio/embedding/_protein.py

mortonjt · 2024-05-15T15:20:39Z

Hi all, thanks for the comments.

After carefully scrutinizing the type signatures, I realized that there is an edge case that could complicate the incorporating of non-sequence vector objects (i.e. molecular vectors).

I've created an intermediate class called EmbeddingVector that primarily serves to (1) address this future edge case and (2) enabling better type checking of vector objects (i.e. checking to make sure that all conversion methods accept iterables of objects that subclass EmbeddingVector).

I realize that this is complicating the hierarchy : Embedding -> EmbeddingVector -> SequenceVector -> ProteinVector, but this will be the last change to the hierarchy, and will avoid short-circuiting ourselves if we wanted to add in additional types in the future.

qiyunzhu · 2024-05-23T06:57:08Z

skbio/embedding/_protein.py

@@ -7,10 +7,28 @@
 # ----------------------------------------------------------------------------
 from skbio.sequence import Protein


Should have an empty line above.

skbio/embedding/_protein.py

qiyunzhu · 2024-05-23T07:01:01Z

skbio/embedding/_protein.py

@@ -36,6 +54,25 @@ class ProteinEmbedding(SequenceEmbedding):
    --------
    Protein

+


This empty line is not needed.

qiyunzhu · 2024-05-23T07:01:29Z

skbio/embedding/_protein.py

+        has definites: True
+        has stops: False
+    --------------------------
+    0 ACDEFGHIKL


Should have an empty line before the end of docstring.

qiyunzhu · 2024-05-23T07:16:29Z

skbio/embedding/_embedding.py

@@ -7,10 +7,28 @@
 # ----------------------------------------------------------------------------


Following the naming convention in scikit-bio, the name of this script file can be changed from _embedding.py into _base.py.

Not necessarily. There is no _base.py in the stats folder for instance.

I see. tree and sequence folders also don't have _base.py. I tend to suggest that the codebase structure should be uniformized. But this can be left to a future discussion.

qiyunzhu · 2024-05-23T07:18:50Z

skbio/embedding/_embedding.py

+    return data
+
+
+def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean',


This function name is bit too long. How about abbreviate distance_matrix as distmat? This should be a commonly used abbreviation. See here and here.

how about to distances?

distances is a good idea!

qiyunzhu · 2024-05-23T07:19:54Z

skbio/embedding/_embedding.py

+    embedding_vectors : iterable of EmbeddingVector objects
+        An iterable of EmbeddingVector objects, or objects that
+        subclass EmbeddingVector.
+    metric : str, optional


metric can be upgraded to support a user function. But it can be postponed to a future PR.

This is already supported, since beta_diversity allows for this

Good to know. Given that, you may update the docstring to reflect that.

qiyunzhu · 2024-05-23T07:22:43Z

skbio/embedding/_embedding.py

+
+    @property
+    def embedding(self):
+        r""" The embedding tensor. """


Is "tensor" the right term here? I thought that a tensor is basically an array that can be stored in GPU.

yes, tensor is the correct term. Terms are supersets of matrices / vectors

qiyunzhu · 2024-05-23T07:23:45Z

skbio/embedding/_embedding.py

+
+def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean',
+                                         validate=True):
+    r""" Convert an iterable of EmbeddingVector objects to a


First sentence of a docstring should not exceed one line.

qiyunzhu · 2024-05-23T07:37:21Z

@mortonjt I see the logic of EmbeddingVector and I like that. I wonder if this term should be called "vector embedding" instead? Update: Never mind about this.

One thing to note, which is somehow relevant but shouldn't affect the design here: scikit-bio's Sequence only support ASCII characters, which is a big limitation (e.g., you can't easily represent methylated nucleotides), but also grants performance since all underlying data are np.uint8. In comparison, Biotite's sequence can contain any custom values (any immutable and hashable)

qiyunzhu

@mortonjt I added some extra comments. I think I have completed a pass of the PR. Thank you!

skbio/embedding/_embedding.py

qiyunzhu · 2024-05-24T16:20:52Z

skbio/embedding/_embedding.py

+        regex_match,
+        f"{dim_name} dimension: {shape}\n{indent}has gaps",
+    )
+    return rstr


 class Embedding(SkbioObject):


The docstring of __int__ should be moved under the parent class.

This is the parent class?

Yes -- I was reading the entire _embedding.py script and noticed this.

qiyunzhu · 2024-05-24T16:28:46Z

skbio/embedding/_embedding.py

+
+        See Also
+        --------
+        Protein


This line may be skbio.sequence.Protein, otherwise Sphinx won't render a hyperlink for it.

qiyunzhu · 2024-05-24T16:30:51Z

skbio/embedding/_protein.py

-            # protein sequence
-            Protein(sequence)
-
+        sequence = _validate_protein(sequence)
        super(ProteinEmbedding, self).__init__(


I think you can just do super().__init__(...). Same for other cases.

That does not work for me. I'm keeping this as is.

skbio/embedding/_embedding.py

qiyunzhu · 2024-05-24T16:51:40Z

skbio/embedding/_embedding.py

@@ -49,7 +67,13 @@ def __init__(self, embedding, ids, **kwargs):
        self._ids = np.array(ids)

    def __str__(self):
-        raise NotImplemented
+        raise NotImplementedError("This method should be implemented by subclasses.")
+


It will be nice to have a shape property of the Embedding class. But this can be done after the PR is merged.

skbio/embedding/_embedding.py

qiyunzhu · 2024-05-24T17:10:56Z

skbio/embedding/_embedding.py

+        eigvals = eigvals,
+        proportion_explained = eigvals / eigvals.sum(),
+        samples=pd.DataFrame(
+            u * np.sqrt(s), index=[str(sv) for sv in embedding_vectors]),


Just to double-check: Is u * np.sqrt(s) the correct math here? I thought it should be u.dot(np.diag(s)).

There are math here is fine because diag here is equivalent to column multiplication.

Its honestly a matter of personal preference. There are many version of SVD - either allocate eigvals to u, to v or split them in half. I personally like to split them in half for the sake of drawing biplots. But I'll go ahead and use your recommendation, since it is a more common usecase.

I see. Thank you! I only knew the most basic form. If you want to adopt this, please also check that features is correct. Btw, this case further suggests that we should have a pca function added to the ordination module, just to make the functionality complete and avoid re-write the current code.

qiyunzhu · 2024-05-24T17:15:45Z

skbio/io/__init__.py

@@ -302,7 +303,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to
 import_module("skbio.io.format.taxdump")
 import_module("skbio.io.format.sample_metadata")
 import_module("skbio.io.format.biom")
-import_module("skbio.io.format.embedding")
+import_module("skbio.io.format.embed")


What is the rationale of renaming embedding and embed. I recall that you mentioned somewhere but I forgot. "embed" is a verb in any situation, like "align", "ordinate", etc.

To reduce confusion. Otherwise we have two different modules named the same thing.
Also .embed is an easier file extension compared to embedding

qiyunzhu · 2024-05-24T17:17:05Z

skbio/embedding/__init__.py

@@ -13,6 +13,7 @@
    :toctree: generated/

    ProteinEmbedding
+    ProteinVector


Can consider adding the classes to skbio/__init__.py, such that they can be used as from skbio import ProteinVector, like other skbio classes do. That being said, I am not a big fan of that, because it causes confusion in hierarchy.

Let's not do this at this point. I agree with you, it could be confusing.

mortonjt

Thanks @qiyunzhu comments should be addressed. Lets see how the tests behave

mortonjt · 2024-05-24T17:46:30Z

skbio/embedding/_embedding.py

+        regex_match,
+        f"{dim_name} dimension: {shape}\n{indent}has gaps",
+    )
+    return rstr


 class Embedding(SkbioObject):


This is the parent class?

mortonjt · 2024-05-24T17:47:17Z

skbio/embedding/_embedding.py

+
+    def bytes(self):
+        r""" Bytes representation of string encoding"""
+        seq = np.frombuffer(str(self).encode("ascii"),


I believe this is already addressed @wasade

mortonjt · 2024-05-24T17:48:05Z

skbio/embedding/_embedding.py

+
+    @property
+    def embedding(self):
+        r""" The embedding tensor. """


yes, tensor is the correct term. Terms are supersets of matrices / vectors

skbio/embedding/_embedding.py

mortonjt · 2024-05-24T17:53:49Z

skbio/embedding/_embedding.py

+    return data
+
+
+def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean',


how about to distances?

mortonjt · 2024-05-24T17:54:39Z

skbio/embedding/_embedding.py

+    embedding_vectors : iterable of EmbeddingVector objects
+        An iterable of EmbeddingVector objects, or objects that
+        subclass EmbeddingVector.
+    metric : str, optional


This is already supported, since beta_diversity allows for this

mortonjt · 2024-05-24T17:56:57Z

skbio/embedding/_embedding.py

+        eigvals = eigvals,
+        proportion_explained = eigvals / eigvals.sum(),
+        samples=pd.DataFrame(
+            u * np.sqrt(s), index=[str(sv) for sv in embedding_vectors]),


There are math here is fine because diag here is equivalent to column multiplication.

Its honestly a matter of personal preference. There are many version of SVD - either allocate eigvals to u, to v or split them in half. I personally like to split them in half for the sake of drawing biplots. But I'll go ahead and use your recommendation, since it is a more common usecase.

mortonjt · 2024-05-24T17:59:30Z

skbio/embedding/_protein.py

-            # protein sequence
-            Protein(sequence)
-
+        sequence = _validate_protein(sequence)
        super(ProteinEmbedding, self).__init__(


That does not work for me. I'm keeping this as is.

mortonjt · 2024-05-24T18:03:15Z

skbio/io/__init__.py

@@ -302,7 +303,7 @@ class of Python's standard library. The goal of a ``sniffer`` is two-fold: to
 import_module("skbio.io.format.taxdump")
 import_module("skbio.io.format.sample_metadata")
 import_module("skbio.io.format.biom")
-import_module("skbio.io.format.embedding")
+import_module("skbio.io.format.embed")


To reduce confusion. Otherwise we have two different modules named the same thing.
Also .embed is an easier file extension compared to embedding

mortonjt · 2024-05-24T19:06:38Z

skbio/embedding/_embedding.py


    @property
    def embedding(self):
        r""" The embedding tensor. """
-        return self._embedding.reshape(1, -1)
+        return np.expand_dims(self._embedding, 0)


reverting this, since the previous version was correct

qiyunzhu · 2024-05-24T19:56:54Z

Everything passes! Shall we merge? @mortonjt

first iteration of protein vector object

0d592db

mortonjt added 2 commits May 13, 2024 09:59

linter

98516ee

renaming files to see if it improves coverage

b11cf02

mortonjt commented May 13, 2024

View reviewed changes

mortonjt added 3 commits May 13, 2024 11:08

improving coverage

f96cbbc

updating coveragerc to determine if tests are being run

0c8ef78

revert and update docs

a7f49a4

mortonjt changed the title ~~WIP : Protein Vector object~~ Protein Vector object May 13, 2024

revert typo

2d80eeb

mortonjt mentioned this pull request May 13, 2024

Feedback about the new pull request comment codecov/feedback#8

Open

wasade reviewed May 14, 2024

View reviewed changes

mortonjt and others added 4 commits May 14, 2024 12:50

Update skbio/io/format/embed.py

a326780

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

Update skbio/io/format/embed.py

7d2b809

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

Update skbio/io/format/embed.py

608a1b6

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

Update skbio/io/format/embed.py

fba202d

Co-authored-by: Daniel McDonald <d3mcdonald@eng.ucsd.edu>

mortonjt commented May 14, 2024

View reviewed changes

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved

Update skbio/embedding/_embedding.py

5002a50

qiyunzhu reviewed May 14, 2024

View reviewed changes

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved

skbio/embedding/_embedding.py Outdated Show resolved Hide resolved

skbio/embedding/_protein.py Outdated Show resolved Hide resolved

mortonjt added 5 commits May 14, 2024 17:09

address @wasade comments

50d0050

address more comments

ab6fdbe

simplifying validate check

bbe10d7

adding extra check for validation in SequenceVector

c211bbe

adding svd test

a9e03cb

qiyunzhu reviewed May 23, 2024

View reviewed changes

qiyunzhu reviewed May 24, 2024

View reviewed changes

address comments

978244e

mortonjt commented May 24, 2024

View reviewed changes

reverting embedding comment

df9d010

mortonjt commented May 24, 2024

View reviewed changes

qiyunzhu merged commit 541a930 into scikit-bio:main May 24, 2024
27 checks passed

		@@ -7,10 +7,28 @@
		# ----------------------------------------------------------------------------
		from skbio.sequence import Protein

		@@ -36,6 +54,25 @@ class ProteinEmbedding(SequenceEmbedding):
		--------
		Protein

		return data


		def embedding_vectors_to_distance_matrix(embedding_vectors, metric='euclidean',

Protein Vector object #2021

Protein Vector object #2021

Conversation

mortonjt commented May 6, 2024 • edited

codecov bot commented May 6, 2024 • edited

Codecov Report

mortonjt commented May 13, 2024

Choose a reason for hiding this comment

mortonjt commented May 13, 2024

mortonjt commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiyunzhu commented May 23, 2024 • edited

qiyunzhu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiyunzhu May 24, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mortonjt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiyunzhu commented May 24, 2024

mortonjt commented May 6, 2024 •

edited

codecov bot commented May 6, 2024 •

edited

qiyunzhu commented May 23, 2024 •

edited

qiyunzhu May 24, 2024 •

edited