ZairaChem: Automated ML-based (Q)SAR

ZairaChem is the first library of Ersilia's family of tools devoted to providing out-of-the-box machine learning solutions for biomedical problems. In this case, we have focused on (Q)SAR models. (Q)SAR models take chemical structures as input and give as output predicted properties, typically pharmacological properties such as bioactivity against a certain target.

Both Ersilia and Zaira are cities described in Italo Calvino's book 'Invisible Cities' (1972). Ersilia is a "trading city" where inhabitants stretch strings from the corners of the houses to establish the relationships that sustain the life of the city. When the strings become too numerous, they rebuild Ersilia elsewhere, and their network of relationships remains. Zaira is a "city of memories". It contains its own past written in every corner, scratched in every pole, window and bannister.

Installation

Clone the repository in your local system

git clone https://github.com/ersilia-os/zaira-chem.git
cd zaira-chem

From the terminal, run the installation script:

bash install_linux.sh

By default, a Conda enviroment named zairachem will be created. Activate it:

conda activate zairachem

Usage

ZairaChem can be run as a command line interface. To learn more about the ZairaChem commands, see the help command_

zairachem --help

Quick start

ZairaChem expects a comma- or tab-separated file containing two columns: a "smiles" column with the molecules in SMILES format and an "activity" column with the activity values.

To get started, let's load an example classification task from Therapeutic Data Commons.

zairachem example --file_name input.csv

This file can be split into train and test sets.

zairachem split -i input.csv

The command above will generate two files your working directory, named train.csv and test.csv. By default, the train:test ratio is 80:20.

Fit

You can train a model as follows:

zairachem fit -i train.csv -m model

This command will run the full ZairaChem pipeline and produce a model folder with processed data, model checkpoints, and reports. If no cut-off is specified for the classification, ZairaChem will establish an internal cut-off to determine Category 0 and category 1. The output results will always provide the probability of a molecule being Category 1. Alternatively, you can set your preferred cuto-off with the following command:

zairachem fit -i train.csv -c 0.1 -d low -m model

Where the '-c' indicates the cut-off of the activity values and the '-d' specifies the direction. If set to 'low', values <= c will be considered 1 and if set to 'high', values => c will be considered 1.

Predict

You can then run predictions on the test set:

zairachem predict -i test.csv -m model -o test

ZairaChem will run predictions using the checkpoints stored in model and store results in the test directory. Several performance plots will be generated alongside prediction outputs.

Additional Information

For further technical details, please read the ZairaChem page of the Ersilia gitbook, which describes each major step in the ZairaChem pipeline. The corresponding publication for the ZairaChem pipeline is available here.

Citation

If you use ZairaChem, please cite us:

@article{Turon2023,
  author = {Turon, G. and Hlozek, J. and Woodland, J.G. and et al.},
  title = {First fully-automated AI/ML virtual screening cascade implemented at a drug discovery centre in Africa},
  journal = {Nat Commun},
  volume = {14},
  pages = {5736},
  year = {2023},
  doi = {10.1038/s41467-023-41512-2},
  url = {https://doi.org/10.1038/s41467-023-41512-2}
}

About us

Learn about the Ersilia Open Source Initiative!

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
config		config
scripts		scripts
zairachem		zairachem
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
install_linux.sh		install_linux.sh
requirements.txt		requirements.txt
setup.py		setup.py

License

ersilia-os/zaira-chem

Folders and files

Latest commit

History

Repository files navigation

ZairaChem: Automated ML-based (Q)SAR

Installation

Usage

Quick start

Fit

Predict

Additional Information

Citation

About us

About

Topics

Resources

License

Stars

Watchers

Forks

Languages