GitHub - naity/protein-transformer: Implement, train, tune, and evaluate a transformer model for antibody classification with this step-by-step code.

Protein-Transformer

Implement, train, tune, and evaluate a transformer model for antibody classification with this step-by-step code.

Read on Medium · Read on Substack · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
Getting Started
- Prerequisites
Usage
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

This project provides a step-by-step guide to implementing a transformer model for protein data, covering training, hyperparameter tuning, and evaluation.

Highlights

Hands-on Transformer Implementation: Follow along with code to build a transformer-based antibody classifier.
Optimize Performance: Explore hyperparameter tuning techniques to improve the model's accuracy.
Evaluation: Assess the model's generalization ability and gain insights into its performance on a hold-out test dataset.

(back to top)

Built With

(back to top)

Getting Started

Clone the repo:

git clone https://github.com/naity/protein-transformer.git

Prerequisites

The requirements.txt file lists the Python packages that need to be installed in order to run the scripts. Please use the command below for installation.

pip install -r requirements.txt

(back to top)

Usage

In this project, we will implement, train, optimize, and evaluate a transformer-based model for antibody classification. The data has been preprocessed, formatted as a binary classification problem with a balanced number of samples in each class. Processed datasets are stored in the data/ directory: bcr_train.parquet is used for training and tuning, while bcr_test.parquet is the hold-out test dataset. For details on the preprocessing steps, please refer to the notebooks/bcr_preprocessing.ipynb notebook.

1. Running the train.py Script

See the table below for key parameters when running the train.py script. For a full list of options, run:

python protein_transformer/train.py --help

Parameter	Description	Default
--run-id	Unique name for the training run	None (Required)
--dataset-loc	Path to the dataset in parquet format	None (Required)
--val-size	Proportion of the dataset for validation	0.15
--embedding-dim	Dimensionality of token embeddings	64
--num-layers	Number of Transformer encoder layers	8
--num-heads	Number of attention heads in the encoder	2
--ffn-dim	Dimensionality of the feed-forward layer in the encoder	128
--dropout	Dropout probability for regularization	0.05
--batch-size	Number of samples per batch for each worker	32
--lr	The learning rate for the optimizer	2e-5
--num-epochs	Number of epochs for training	20

For example, to execute the training script with default parameters and store the results under a run ID named train01, use the following command:

python protein_transformer/train.py --run-id train01 --dataset-loc data/bcr_train.parquet

Upon completion, the script stores training results in the runs/train01 directory by default. This includes model arguments, the best-performing model (based on validation loss), training and validation loss records, along with validation metrics for each epoch. These metrics, which include the following, are saved in the runs/train01/results.csv file:

Accuracy: 0.727
AUC score: 0.851
Precision: 0.734
Recall: 0.727
F1-score: 0.725

2. Running the tune.py Script

See the table below for key parameters when running the tune.py script. For a full list of options, run:

python protein_transformer/tune.py --help

Parameter	Description	Default
--run-id	Unique name for the hyperparameter tuning run	None (Required)
--dataset-loc	Absolute path to the dataset in parquet format	None (Required)
--val-size	Proportion of the dataset for validation	0.15
--num-classes	Number of final output dimensions	2
--batch-size	Number of samples per batch for each worker	32
--num-epochs	Number of epochs for training (per trial)	30
--num-samples	Number of trials for tuning	100
--gpu-per-trial	Number of GPUs to allocate per trial	0.2

Note: The --dataset-loc parameter must be specified as an absolute path.

For example, to initiate the tuning process with default parameters and store the results under a run ID named tune01, execute the tune.py script from the project root directory:

python protein_transformer/tune.py --run-id tune01 --dataset-loc /home/ytian/github/protein-transformer/data/bcr_train.parquet

By default, it will execute 100 trials with different parameter combinations, running each trial for up to 30 epochs. Ray Tune utilizes early stopping for unpromising trials, allowing for efficient exploration of the hyperparameter space and focuses resources on better-performing configurations. It will track the results of each trial, and upon completion, the best-performing model based on validation loss will be saved in the runs/tune01 directory by default. Additionally, tuning logs, including results from each trial, are stored within the same runs/tune01 directory for easy access and analysis.

3. Running the evaluate.py Script

See the table below for key parameters when running the evaluate.py script. For a full list of options, run:

python protein_transformer/evaluate.py --help

Parameter	Description	Default
--run-dir	Path to the output directory for a training or tuning run	None (Required)
--dataset-loc	Path to the test dataset in parquet format	None (Required)
--batch-size	Number of samples per batch	64

For example, to evaluate the best model from the tune01 run on the hold-out test dataset, execute the following command from the command line:

python protein_transformer/evaluate.py --run-dir runs/tune01 --dataset-loc /home/ytian/github/protein-transformer/data/bcr_test.parquet

Upon completion, the script will save test metrics in a file named test_metrics.json, like the following example, within the run directory provided in the evaluate.py command:

Accuracy: 0.761
AUC score: 0.837
Precision: 0.761
Recall: 0.761
F1-score: 0.761

(back to top)

Roadmap

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the Apache License. See LICENSE.txt for more information.

(back to top)

Contact

(back to top)

Acknowledgments

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

images

images

notebooks

notebooks

protein_transformer

protein_transformer

runs

runs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Protein-Transformer

About The Project

Built With

Getting Started

Prerequisites

Usage

Roadmap

Contributing

License

Contact

Acknowledgments

About

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
images		images
notebooks		notebooks
protein_transformer		protein_transformer
runs		runs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

naity/protein-transformer

Folders and files

Latest commit

History

Repository files navigation

Protein-Transformer

About The Project

Built With

Getting Started

Prerequisites

Usage

Roadmap

Contributing

License

Contact

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Languages