This is a DeepChain app to predict the protein family id out of a given sequence. 🧬 🧪 🔍
The app can be found in the DeepChain App hub.
- This app has been trained with the
pfam32.0
dataset available with the bio-datasets API:
# Load pfam dataset
pfam_dataset = load_dataset("pfam-32.0", force=True)
_, y = pfam_dataset.to_npy_arrays(input_names=["sequence"], target_names=["family_id"])
-
This dataset contains roughly 1339k protein sequences for which the following features are available:
sequence
- raw sequence featuresequence_name
- name of the sequencesplit
- original train/dev/test splitfamily_id
- targetfamily_accession
- associated tofamily_id
-
There are 17929 unique families, for which only 13071 are present in all splits.
-
For the
sequence
feature, corresponding ProtBert (pooling: mean
) embeddings have been computed. For compute reasons, only the embeddings for the first 200 000 sequences are available. The rest will follow very soon. -
This app used bio-transformers to compute these embeddings.
-
The original dataset can be found here: Pfam32.0, or on Kaggle.
- The classifier takes as input the sequence embeddings (
1024-dim
vector) and then uses a Dense multi-classification to predict the protein family id. The model architecture can be found below:
FamilyMLP(
(_model): Sequential(
(0): Linear(in_features=1024, out_features=256, bias=True)
(1): ReLU()
(2): Dropout(p=0.1, inplace=False)
(3): Linear(in_features=256, out_features=256, bias=True)
(4): ReLU()
(5): Dropout(p=0.1, inplace=False)
(6): Linear(in_features=256, out_features=num_classes, bias=True)
)
)
- For training, we used the already given train/val split and we filtered out training samples for which their classes were not present in the dev split set. We have 13071 final family ids.
- The model has been trained on a P100 GPU for 20 epochs and reached an accuracy of 87%.
- In the future, we will reach to a better accuracy with improvements such as:
- Higher model capacity
- Weighted loss to account for rare classes
- Better and longer training strategy
- deepchain-app-pfam-32.0
- src/
- app.py
- DESCRIPTION.md
- tags.json
- Optionnal : requirements.txt (for extra packages)
- checkpoint/
- family_model.pt
- label_encoder.joblib
- src/
This app is mean to be deployed in deepchain.bio and has been implemented thanks to the following libraries:
- The main deepchain-apps package - can be found on pypi.
- The bio-transformers package.
- The bio-datasets package.
compute_scores()
returns a dictionary for each sequence with the predicted. "protein_family_id"
[
{
'protein_family_id': 'PuR_N'
},
{
'protein_family_id':'Rrf2'
}
]
Further information on DeepChain App templates can be found here.
Apache License Version 2.0