Skip to content

A workflow for metabolite identification and accurate profiling in multidimensional LC-IM-MS-DIA measurements. DOI: 10.5281/zenodo.

License

Notifications You must be signed in to change notification settings

EMSL-Computing/PeakDecoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PeakDecoder

PeakDecoder is a machine learning-based metabolite identification algorithm for multidimensional mass spectrometry measurements incorporating liquid chromatography (LC) and ion mobility spectrometry (IM) separations, and collecting extensive fragmentation spectra with data-independent acquisition (DIA) methods. The algorithm learns to distinguish true co-elution and co-mobility from raw data and calculates metabolite identification error rates.

Workflow

  • Step-1, Feature finding and fragment ion deconvolution: data is processed in untargeted mode (using MS-DIAL) to extract all precursor ion features (MS1) and their respective deconvoluted fragment ions (pseudo MS2) based on co-elution and co-mobility. The alignment (Peak ID matrix, msp format) and all peak lists (txt, centroid) should be exported from MS-DIAL.
  • Step-2, Target and decoy generation: a preliminary training set is generated by using the detected and deconvoluted peak-groups as targets and producing their corresponding decoys.
  • Step-3, Targeted data extraction for training: targeted data extraction is performed (usig Skyline) to extract the precursor and fragment ion signals for the training set from all the LC-IM-MS runs and export their XIC metrics. The Skyline report should include the required XIC metrics: area, height, mass error, FWHM (LC), RT, expected RT, expected CCS.
  • Step-4, Machine learning training: an SVM classifier is trained using multiple scores calculated from the XIC metrics of the training set. Before training, filtering for high-quality fragments is applied to keep high-quality peak-groups as targets (i.e., based on various thresholds for metrics of precursor and at least 3 fragments: S/N, mass error, RT difference to precursor, and FWHM difference to precursor) and their corresponding decoys in the final training set. The model learns to distinguish true and false co-elution and co-mobility, independently of the features’ metabolite identity.
  • Step-5, Targeted data extraction for inference: TDX is performed to extract the signals of the query set of metabolites in the library from all the LC-IM-MS runs and export their XIC metrics.
  • Step-6, Machine learning inference: the trained model is used to determine the PeakDecoder score of the query set of metabolites and estimate an false discovery rate (FDR). Results can be filtered using the PeakDecoder score corresponding to the estimated FDR threshold from a table with pairs of values (FDR, PeakDecoder score) automatically generated after training (file PeakDecoder-FDR-thresholds_[dataset].csv).

Data

The 3 subfolder contain input and output files to run the PeakDecoder steps for the synthetic biology datasets:

  • Asper: Aspergillus pseudoterreus and Aspergillus niger strains
  • Pput: Pseudomonas putida strains
  • Rhodo: Rhodosporidium toruloides strains

Contact

aivett.bilbao@pnnl.gov

Reference

If you use PeakDecoder or any portions of this code please cite: Bilbao et al. "PeakDecoder enables machine learning-based metabolite annotation and accurate profiling in multidimensional mass spectrometry measurements". Nature Communications https://doi.org/10.1038/s41467-023-37031-9.

DOI