Pfam32.0 classifier

Description

This is a DeepChain app to predict the protein family id out of a given sequence. 🧬 🧪 🔍

The app can be found in the DeepChain App hub.

Data

This app has been trained with the pfam32.0 dataset available with the bio-datasets API:

# Load pfam dataset
pfam_dataset = load_dataset("pfam-32.0", force=True)
_, y = pfam_dataset.to_npy_arrays(input_names=["sequence"], target_names=["family_id"])

This dataset contains roughly 1339k protein sequences for which the following features are available:
- sequence - raw sequence feature
- sequence_name - name of the sequence
- split - original train/dev/test split
- family_id - target
- family_accession - associated to family_id
There are 17929 unique families, for which only 13071 are present in all splits.
For the sequence feature, corresponding ProtBert (pooling: mean) embeddings have been computed. For compute reasons, only the embeddings for the first 200 000 sequences are available. The rest will follow very soon.
This app used bio-transformers to compute these embeddings.
The original dataset can be found here: Pfam32.0, or on Kaggle.

Model

The classifier takes as input the sequence embeddings (1024-dim vector) and then uses a Dense multi-classification to predict the protein family id. The model architecture can be found below:

FamilyMLP(
  (_model): Sequential(
    (0): Linear(in_features=1024, out_features=256, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1, inplace=False)
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.1, inplace=False)
    (6): Linear(in_features=256, out_features=num_classes, bias=True)
  )
)

For training, we used the already given train/val split and we filtered out training samples for which their classes were not present in the dev split set. We have 13071 final family ids.
The model has been trained on a P100 GPU for 20 epochs and reached an accuracy of 87%.
In the future, we will reach to a better accuracy with improvements such as:
- Higher model capacity
- Weighted loss to account for rare classes
- Better and longer training strategy

App structure

deepchain-app-pfam-32.0
- src/
  - app.py
  - DESCRIPTION.md
  - tags.json
  - Optionnal : requirements.txt (for extra packages)
- checkpoint/
  - family_model.pt
  - label_encoder.joblib

This app is mean to be deployed in deepchain.bio and has been implemented thanks to the following libraries:

The main deepchain-apps package - can be found on pypi.
The bio-transformers package.
The bio-datasets package.

Examples

compute_scores() returns a dictionary for each sequence with the predicted. "protein_family_id"

[
  {
    'protein_family_id': 'PuR_N'
  },
   {
    'protein_family_id':'Rrf2'
  }
]

Templates

Further information on DeepChain App templates can be found here.

License

Apache License Version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.docs/source/_static		.docs/source/_static
.idea		.idea
checkpoint		checkpoint
examples		examples
src		src
.gitignore		.gitignore
README.md		README.md
README_deepchainapps.md		README_deepchainapps.md
__init__.py		__init__.py
random.ipynb		random.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.docs/source/_static

.docs/source/_static

.idea

.idea

checkpoint

checkpoint

examples

examples

src

src

.gitignore

.gitignore

README.md

README.md

README_deepchainapps.md

README_deepchainapps.md

init.py

init.py

random.ipynb

random.ipynb

Repository files navigation

Pfam32.0 classifier

Description

Data

Model

App structure

Examples

Templates

License

About

Releases

Packages

Languages

theomeb/deepchain-app-pfam-32.0

Folders and files

Latest commit

History

Repository files navigation

Pfam32.0 classifier

Description

Data

Model

App structure

Examples

Templates

License

About

Topics

Resources

Stars

Watchers

Forks

Languages