NP_Fingerprints

Scripts to calculate fingerprints and simiilarity matrices for natural product databases.

Repository structure

Data: Contains the COCONUT database, its preprocessed version and dataset statistics.
FPs: Contains all calculated fingerprints as numpy arrays, stored via pickle.
Scripts: Contains all scripts and utility functions used to calculate fingerprints and reproduce the results.
Results: Contains all the results from the analysis.

Installation

All necessary Python packages can be installed via conda from the environment.yml file.

git clone https://github.com/dahvida/NP_Fingerprints
conda env create --name np_fp --file=environment.yml
conda activate np_fp
cd NP_Fingerprints/Scripts

Additionally, you need to install Java 11 for computing CDK / jCompoundMapper fingerprints, and unzip the .jar files in FP_calc.

Tutorial

All scripts must be executed from the Scripts folder.
Make sure to unzip COCONUT_DB.zip in Data and the Java tools in Scripts/FP_calc.
To clean the raw COCONUT database (accessed from https://github.com/reymond-group/Coconut-TMAP-SVM), use cleanup_script.py. The preprocessed dataset will be saved in Data, as well as class and Murcko scaffold statistics:

python3 cleanup_script.py

To calculate a set of fingerprints for the preprocessed version of the COCONUT database, use fp_script.py. You must specify which set of fingerprints you're interested on calculating via the --FP_type flag. Options are "rdkit", "minhash", "cdk" and "jmap". All fingerprints will be stored in pickle files as numpy arrays in the FPs folder.

python3 fp_script.py --FP_type rdkit

To generate the fingerprint correlation matrices, use sim_search_script.py. The script will sample 50 batches of 10.000 compounds each to reduce the computational load of running all pairwise comparisons on 129k compounds. The output will be saved in the Results folder. The default arguments are the same as the ones employed in our study. Otherwise, they can be modified via the appropriate flags, i.e:

python3 sim_search_script.py --sample_size 5000 --n_cores 4

To generate the classification analysis results, use clf_script.py. By default, the script will run 20 bayesian hyperparameter optimization iterations and evaluate its performance on the test set for 5 replicates. Similarly as before, these parameters can be modified via the appropriate flags, i.e:

python3 clf_script.py --opt_iters 10 --n_replicates 50

Finally, to get more information on the arguments that can be passed to each command line tool and the steps employed in each script, you can use the --help flag:

python3 clf_script.py --help

Further information

More details on which fingerprints are available, which results are saved where and so forth can be found in the respective README files of each folder.

How to cite

Please refer to the original publication: https://doi.org/10.1186/s13321-024-00830-3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NP_Fingerprints

Repository structure

Installation

Tutorial

Further information

How to cite

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Data		Data
FPs		FPs
Results		Results
Scripts		Scripts
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

dahvida/NP_Fingerprints

Folders and files

Latest commit

History

Repository files navigation

NP_Fingerprints

Repository structure

Installation

Tutorial

Further information

How to cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages