Skip to content

Module to auto evaluate FLIP datasets via bio-trainer

License

Notifications You must be signed in to change notification settings

J-SNACKKB/autoeval

Repository files navigation

AutoEval

This repository contains AutoEval, a module for a fast and easy evaluation of FLIP benchmarking tasks. It uses biotrainer to train the task-specific models and bio-embeddings or custom embedders to embed proteins.

Its way of working is as simple as

python run-autoeval.py scl_mixed_soft residues_to_class ./results --embedder prottrans_t5_xl_u50

where

  • scl_mixed_soft indicates the task and the split to be evaluated,
  • residues_to_class the protocol used for the tasks,
  • ./results the output directory,
  • and --embedder prottrans_t5_xl_u50 the embedder from bio-embeddings to be used

The different options are summarized below.

Installation and running

  1. Make sure you have poetry installed:
curl -sSL https://install.python-poetry.org/ | python3 - --version 1.4.2
  1. Install dependencies and biotrainer via poetry:
# In the base directory:
poetry install
# Optional: Add bio-embeddings to compute embeddings
poetry install --extras "bio-embeddings"
# You can also install all extras at once
poetry install --all-extras

To run AutoEval:

  • with Poetry:
# Option 1:
poetry run autoeval DATASET_SPLIT PROTOCOL WORKING_DIR [...]

# Option 2:
autoeval DATASET_SPLIT PROTOCOL WORKING_DIR [...]

The provieded run-autoeval.py can also be used.

  • with Docker:
# Build
docker build -t autoeval .
# Run
docker run --rm \
    -v "$(pwd)/examples/docker":/mnt \
    -v bio_embeddings_weights_cache:/root/.cache/bio_embeddings \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    biotrainer:latest /mnt/config.yml

Options

Parameter Usage
split Name of the split, e.g. aav_des_mut. The different options are listed at the end of this file.
protocol Task-specific training protocol to use from the available ones in biotrainer: residue_to_class, residues_to_class, sequence_to_class and sequence_to_value.
working_dir Path to the working directory.
-e / --embedder Embedder to use if different from the one in the default configuration. It can be from the ones available in bio-embeddings, e.g. esm1b; or a custom embedder (see details here).
-f / --embeddingsfile Path to the file containing precomputed embeddings if available.
-m / --model Model to use if different fro them one in the default configuration. It should be one from the ones available in biotrainer, e.g. FNN or CNN.
-c / --config Config file different from the provided one in configsbank for the indicated split.
-mins / --minsize Only use proteins the given minimum length.
-maxs / --maxsize Only use proteins the given maximum length.
-mask / --mask If set, use the masks in the file mask.fasta from the split to filter the residues. It also accepts a path to a different masks file.

Default configurations

For every task, the original configuration is the one used by default (defined in the configsbank folder). A different configuration can be used by changing the input arguments of AutoEval or by copying and changing the given one. The default can be overwritten using --config NEW_CONFIG.yml.

Dataset Type of task Recommended pLM Embeddings Recommended model Reference Available in Configsbank
AAV sequence_to_value - FNN [Dallago 2021] ⚠️
GB1 sequence_to_value - FNN [Dallago 2021] ⚠️
Meltome sequence_to_value - FNN [Dallago 2021] ⚠️
SCL residues_to_class ProtT5 (ProtT5-XL-UniRef50) LightAttention [Stärk 2021]
Bind residue_to_class ProtT5 (ProtT5-XL-UniRef50) CNN [Littmann 2021]
SAV sequence_to_class ProtT5 (ProtT5-XL-U50) FNN [Marquet 2021] ⚠️
Secondary Structure residue_to_class ProtT5 (ProtT5-XL-U50) CNN -
Conservation residue_to_class ProtT5 (ProtT5-XL-U50) CNN [Marquet 2021]

Availability semaphore:

  • ✅: Available in configsbank is the closest possible way to the best configuration in the reference.
  • ⚠️: The best configuration is not possible due to, e.g., a lack of features in biotrainer. The best possible alternative is the one available.
  • ❌: Not available in configsbank. Somecases can be used anyhow under user's responsability.

Available splits

In order to reference the split to be evaluated the pattern dataset_split must be followed. For example, the split seven_vs_many from the dataset aav must be referenced as aav_seven_vs_many.

Dataset Splits
AAV (aav_*) des_mut, mut_des, one_vs_many, two_vs_many, seven_vs_many, low_vs_high, sampled
Meltome (meltome_*) mixed_split, human, human_cell
GB1 (gb1_*) one_vs_rest, two_vs_rest, three_vs_rest, low_vs_high, sampled
SCL (scl_*) mixed_soft, mixed_hard, human_soft, human_hard, balanced, mixed_vs_human_2
Bind (bind_*) one_vs_many, two_vs_many, from_publication, one_vs_sm, one_vs_mn, one_vs_sn
SAV (sav_*) mixed, human, only_savs
Secondary Structure (secondary_structure_*) sampled
Conservation (conservation_*) sampled

About

Module to auto evaluate FLIP datasets via bio-trainer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published