Skip to content

naity/protein-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Logo

Protein-Transformer

Implement, train, tune, and evaluate a transformer model for antibody classification with this step-by-step code.


Read on Medium · Read on Substack · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

This project provides a step-by-step guide to implementing a transformer model for protein data, covering training, hyperparameter tuning, and evaluation.

Highlights

  • Hands-on Transformer Implementation: Follow along with code to build a transformer-based antibody classifier.
  • Optimize Performance: Explore hyperparameter tuning techniques to improve the model's accuracy.
  • Evaluation: Assess the model's generalization ability and gain insights into its performance on a hold-out test dataset.

(back to top)

Built With

  • Python
  • Pytorch
  • Ray
  • scikit-learn
  • Pandas
  • NumPy
  • Typer

(back to top)

Getting Started

Clone the repo:

git clone https://github.com/naity/protein-transformer.git

Prerequisites

The requirements.txt file lists the Python packages that need to be installed in order to run the scripts. Please use the command below for installation.

pip install -r requirements.txt

(back to top)

Usage

In this project, we will implement, train, optimize, and evaluate a transformer-based model for antibody classification. The data has been preprocessed, formatted as a binary classification problem with a balanced number of samples in each class. Processed datasets are stored in the data/ directory: bcr_train.parquet is used for training and tuning, while bcr_test.parquet is the hold-out test dataset. For details on the preprocessing steps, please refer to the notebooks/bcr_preprocessing.ipynb notebook.

1. Running the train.py Script

See the table below for key parameters when running the train.py script. For a full list of options, run:

python protein_transformer/train.py --help 
Parameter Description Default
--run-id Unique name for the training run None (Required)
--dataset-loc Path to the dataset in parquet format None (Required)
--val-size Proportion of the dataset for validation 0.15
--embedding-dim Dimensionality of token embeddings 64
--num-layers Number of Transformer encoder layers 8
--num-heads Number of attention heads in the encoder 2
--ffn-dim Dimensionality of the feed-forward layer in the encoder 128
--dropout Dropout probability for regularization 0.05
--batch-size Number of samples per batch for each worker 32
--lr The learning rate for the optimizer 2e-5
--num-epochs Number of epochs for training 20

For example, to execute the training script with default parameters and store the results under a run ID named train01, use the following command:

python protein_transformer/train.py --run-id train01 --dataset-loc data/bcr_train.parquet

Upon completion, the script stores training results in the runs/train01 directory by default. This includes model arguments, the best-performing model (based on validation loss), training and validation loss records, along with validation metrics for each epoch. These metrics, which include the following, are saved in the runs/train01/results.csv file:

Accuracy: 0.727
AUC score: 0.851
Precision: 0.734
Recall: 0.727
F1-score: 0.725

2. Running the tune.py Script

See the table below for key parameters when running the tune.py script. For a full list of options, run:

python protein_transformer/tune.py --help 
Parameter Description Default
--run-id Unique name for the hyperparameter tuning run None (Required)
--dataset-loc Absolute path to the dataset in parquet format None (Required)
--val-size Proportion of the dataset for validation 0.15
--num-classes Number of final output dimensions 2
--batch-size Number of samples per batch for each worker 32
--num-epochs Number of epochs for training (per trial) 30
--num-samples Number of trials for tuning 100
--gpu-per-trial Number of GPUs to allocate per trial 0.2
  • Note: The --dataset-loc parameter must be specified as an absolute path.

For example, to initiate the tuning process with default parameters and store the results under a run ID named tune01, execute the tune.py script from the project root directory:

python protein_transformer/tune.py --run-id tune01 --dataset-loc /home/ytian/github/protein-transformer/data/bcr_train.parquet

By default, it will execute 100 trials with different parameter combinations, running each trial for up to 30 epochs. Ray Tune utilizes early stopping for unpromising trials, allowing for efficient exploration of the hyperparameter space and focuses resources on better-performing configurations. It will track the results of each trial, and upon completion, the best-performing model based on validation loss will be saved in the runs/tune01 directory by default. Additionally, tuning logs, including results from each trial, are stored within the same runs/tune01 directory for easy access and analysis.

3. Running the evaluate.py Script

See the table below for key parameters when running the evaluate.py script. For a full list of options, run:

python protein_transformer/evaluate.py --help 
Parameter Description Default
--run-dir Path to the output directory for a training or tuning run None (Required)
--dataset-loc Path to the test dataset in parquet format None (Required)
--batch-size Number of samples per batch 64

For example, to evaluate the best model from the tune01 run on the hold-out test dataset, execute the following command from the command line:

python protein_transformer/evaluate.py --run-dir runs/tune01 --dataset-loc /home/ytian/github/protein-transformer/data/bcr_test.parquet

Upon completion, the script will save test metrics in a file named test_metrics.json, like the following example, within the run directory provided in the evaluate.py command:

Accuracy: 0.761
AUC score: 0.837
Precision: 0.761
Recall: 0.761
F1-score: 0.761

(back to top)

Roadmap

  • Data Processing
  • Model Implementation
  • Training
  • Hyperparameter Tuning
  • Evaluation

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

License

Distributed under the Apache License. See LICENSE.txt for more information.

(back to top)

Contact

ytiancompbio ytiancompbio @yuan_tian ytiancompbio

(back to top)

Acknowledgments

(back to top)