Pathway-Activity Likelihood Analysis and Metabolite Annotation for Untargeted Metabolomics Using Probabilistic Modeling

Motivation: Untargeted metabolomics comprehensively characterizes small molecules and elucidates activities of biochemical pathways within a biological sample. Despite computational advances, interpreting collected measurements and determining their biological role remains a challenge. Results: To interpret measurements, we present an inference-based approach, termed Probabilistic modeling for Untargeted Metabolomics Analysis (PUMA). Our approach captures metabolomics measurements and the biological network for the biological sample under study in a generative model and uses stochastic sampling to compute posterior probability distributions. PUMA predicts the likelihood of pathways being active, and then derives probabilistic annotations, which assign chemical identities to measurements. Unlike prior pathway analysis tools that analyze differentially active pathways, PUMA defines a pathway as active if the likelihood that the path generated the observed measurements is above a particular (user-defined) threshold. Due to the lack of “ground truth” metabolomics datasets, where all measurements are annotated and pathway activities are known, PUMA is validated on synthetic datasets that are designed to mimic cellular processes. PUMA, on average, outperforms pathway enrichment analysis by 8%. PUMA is applied to two case studies. PUMA suggests many biological meaningful pathways as active. Annotation results were in agreement to those obtained using other tools that utilize additional information in the form of spectral signatures. Importantly, PUMA annotates many measurements, suggesting 23 chemical identities for metabolites that were previously only identified as isomers, and a significant number of additional putative annotations over spectral database lookups. For an experimentally validated 50-compound dataset, annotations using PUMA yielded 0.833 precision and 0.676 recall.

Keywords: machine learning; inference; untargeted metabolomics; biological network; metabolic model

Ramtin Hosseini, Neda Hassanpour, Liping Liu, and Soha Hassoun (Ramtin.Hosseini@tufts.edu) "Pathway-Activity Likelihood Analysis and Metabolite Annotation for Untargeted Metabolomics using Probabilistic Modeling", Metabolites 2020, 10, 183.

Link: https://www.mdpi.com/2218-1989/10/5/183

Getting Started

Python 3.5.6 is used for development. We recommend installing packages using Anaconda as follows:
conda create --name PUMA --file enviroment.yml
conda activate PUMA

Dataset

Two examples are provided:
CHO_cell: Chinese Hampster Ovary Cell example provided in the paper
human_urine: A human urine example, see reference in the paper

How to run the code?

Start executing the code by python run_puma.py

Methods

Illustrative Example

Illustrative example of uncertainty when mapping measurements to metabolites and pathways. Pathways (ovals) are associated with metabolites (circles), which in turn are associated with measurements (square). White circles represent non-measured metabolite with membership in one or more pathways. Blue circles represent measured metabolites that have multiple-pathway memberships (multiple-pathway membership is assumed but not shown for j3 and j4). The red circle represents a metabolite that has membership in only one pathway. Measurement w5 uniquely maps to j13, which uniquely maps to Pathway 2, while all other measurements map to multiple metabolites, as shown by solid or dotted lines.

Workflow

Comparison of a workflow to collect and interpret observations (A), and a generative model that captures a biological process (B).

The Generative Model

Graphical representation of the generative model. To avoid representing all I pathways, J metabolites, and K masses in the graph we use the ‘plate’ notation and draw one representative node per variable and enclosing these variables in a plate (rectangular box). The number of instances of each enclosed variable is indicated by the fixed constant in the lower right corner of the box. Random variables of the model (a, z, m, w) are shown in white circles. The variable m has a deterministic relationship with Z. The shaded circle, labelled w, represents an observed random variable. μ, λ, γ are parameters to the model.

Results

Model Validation

AUC for ROC curves for the synthetic datasets under different assumptions regarding pathway activity and metabolite generation. (A,C,E) PUMA. (B,D,F) Enrichment ratio.

Case Study: Chinese Hamster Ovary (CHO) Cell

Probabilities of Pathway Activities

Probability of pathway activities as computed by PUMA vs. enrichment ratios for CHO cell. Each data point is marked as either statistically enriched (red) or non-statistically enriched (blue) based on a Fisher’s Exact Test p-values of 0.05.

Probabilities of Metabolite Annotations

Evaluation of PUMA in Overcoming Uncertainty in Annotation

Experiments on synthetic datasets

PUMA performance in overcoming uncertainty of multiple assignments. Average recall, precision and accuracy for different experiments in synthetic dataset assuming 0.3 of pathways are active. The x-axis corresponds to the fraction of active metabolites, and y-axis shows correctly identified pathways with multiple candidates. (A) original 𝜏 matrix, mapping metabolites to the corresponding mass bin, (B) 𝜏 matrix is modified based on ground truth, and (C) 𝜏 matrix is modified based on random selection of metabolites.

PUMA performance in overcoming uncertainty of multiple assignments. Average recall, precision and accuracy for different experiments in synthetic dataset assuming 0.5 of pathways are active. The x-axis corresponds to the fraction of active metabolites, and y-axis shows correctly identified pathways with multiple candidates. (A) original 𝜏 matrix, (B) 𝜏 matrix is modified based on ground truth, and (C) 𝜏 matrix is modified based on random selection of metabolites.

PUMA performance in overcoming uncertainty of multiple assignments. Average recall, precision and accuracy for different experiments in synthetic dataset assuming 0.7 of pathways are active. The x-axis corresponds to the fraction of active metabolites, and y-axis shows correctly identified pathways with multiple candidates. (A) original 𝜏 matrix, (B) 𝜏 matrix is modified based on ground truth, and (C) 𝜏 matrix is modified based on random selection of metabolites.

Case Study: Human Urinary Sample

Probabilities of Pathway Activities

Probability of pathway activities as computed by PUMA vs. enrichment ratios for the human urine sample. Each data point is marked as either statistically enriched (red) or non-statistically enriched (blue) based on a Fisher’s Exact Test p-values of 0.05.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Figures		Figures
data		data
README.md		README.md
environment.yml		environment.yml
metabolite_prediction.py		metabolite_prediction.py
pathway_prediction.py		pathway_prediction.py
process_input_data.py		process_input_data.py
run_puma.py		run_puma.py
util.py		util.py

Ramtin92/PUMA

Folders and files

Latest commit

History

Repository files navigation

Pathway-Activity Likelihood Analysis and Metabolite Annotation for Untargeted Metabolomics Using Probabilistic Modeling

Getting Started

Dataset

How to run the code?

Methods

Illustrative Example

Workflow

The Generative Model

Results

Model Validation

Case Study: Chinese Hamster Ovary (CHO) Cell

Probabilities of Pathway Activities

Probabilities of Metabolite Annotations

Evaluation of PUMA in Overcoming Uncertainty in Annotation

Experiments on synthetic datasets

Case Study: Human Urinary Sample

Probabilities of Pathway Activities

Probabilities of Metabolite Annotations

About

Topics

Resources

Stars

Watchers

Forks

Languages