Skip to content

theomeb/deepchain-app-pfam-32.0

Repository files navigation

Pfam32.0 classifier

Description

This is a DeepChain app to predict the protein family id out of a given sequence. 🧬 🧪 🔍

The app can be found in the DeepChain App hub.

Data

  • This app has been trained with the pfam32.0 dataset available with the bio-datasets API:
# Load pfam dataset
pfam_dataset = load_dataset("pfam-32.0", force=True)
_, y = pfam_dataset.to_npy_arrays(input_names=["sequence"], target_names=["family_id"])
  • This dataset contains roughly 1339k protein sequences for which the following features are available:

    • sequence - raw sequence feature
    • sequence_name - name of the sequence
    • split - original train/dev/test split
    • family_id - target
    • family_accession - associated to family_id
  • There are 17929 unique families, for which only 13071 are present in all splits.

  • For the sequence feature, corresponding ProtBert (pooling: mean) embeddings have been computed. For compute reasons, only the embeddings for the first 200 000 sequences are available. The rest will follow very soon.

  • This app used bio-transformers to compute these embeddings.

  • The original dataset can be found here: Pfam32.0, or on Kaggle.

Model

Architecture

  • The classifier takes as input the sequence embeddings (1024-dim vector) and then uses a Dense multi-classification to predict the protein family id. The model architecture can be found below:
FamilyMLP(
  (_model): Sequential(
    (0): Linear(in_features=1024, out_features=256, bias=True)
    (1): ReLU()
    (2): Dropout(p=0.1, inplace=False)
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): ReLU()
    (5): Dropout(p=0.1, inplace=False)
    (6): Linear(in_features=256, out_features=num_classes, bias=True)
  )
)
  • For training, we used the already given train/val split and we filtered out training samples for which their classes were not present in the dev split set. We have 13071 final family ids.
  • The model has been trained on a P100 GPU for 20 epochs and reached an accuracy of 87%.
  • In the future, we will reach to a better accuracy with improvements such as:
    • Higher model capacity
    • Weighted loss to account for rare classes
    • Better and longer training strategy

App structure

  • deepchain-app-pfam-32.0
    • src/
      • app.py
      • DESCRIPTION.md
      • tags.json
      • Optionnal : requirements.txt (for extra packages)
    • checkpoint/
      • family_model.pt
      • label_encoder.joblib

This app is mean to be deployed in deepchain.bio and has been implemented thanks to the following libraries:

Examples

compute_scores() returns a dictionary for each sequence with the predicted. "protein_family_id"

[
  {
    'protein_family_id': 'PuR_N'
  },
   {
    'protein_family_id':'Rrf2'
  }
]

Templates

Further information on DeepChain App templates can be found here.

License

Apache License Version 2.0

About

A deepchain-app to predict protein family 🔬

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published