Skip to content

JosieHong/awesome-mass-spectrometry-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Machine Learning in Small Molecules Mass Spectrometry

Awesome

Mass spectrometry, also called mass spec, is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a mass spectrum, a plot of intensity as a function of the mass-to-charge ratio.

from Wikipedia

Keep updating the awesome machine-learning papers and codes related to small molecules mass spectrometry. Please notice that awesome lists are curations of the best, not everything. Contributes are always welcome!

Contents

Databases

(Quantum-related) molecular properties:

  • OC20 & OC22: The Open Catalyst Project focuses on using AI to find new renewable energy storage catalysts, releasing the OC20 and OC22 datasets with 1.3 million molecular relaxations from 260 million DFT calculations for research support.
  • QM9: This dataset includes the computed geometric, energetic, electronic, and thermodynamic properties of 134,000 stable small organic molecules composed of CHONF.
  • GEOM: This dataset features 37 million molecular conformations for over 450,000 molecules, generated using advanced sampling and semi-empirical density functional theory (DFT).
  • MD17 & MD22: The MD22 benchmark dataset includes molecular dynamics trajectories of seven biomolecular and supramolecular systems, with atom counts ranging from 42 to 370, sampled at 400-500 K with 1 fs resolution, and energy and forces calculated using PBE+MBD theory.
  • PCQM4Mv2: PCQM4Mv2 is a quantum chemistry dataset derived from the PubChemQC project, focusing on the ML task of predicting DFT-calculated HOMO-LUMO energy gaps of molecules using their 2D graphs, a significant task due to the expense of obtaining 3D equilibrium structures.
  • MoleculeNet: MoleculeNet is a benchmark for testing machine learning methods on molecular properties, featuring over 700,000 compounds from multiple databases, integrated into the DeepChem package, and evaluates model performances using metrics like AUC-ROC, AUC-PRC, RMSE, and MAE.

MS/MS:

  • NIST23: The NIST MS/MS Library 2023 is a collection of MS/MS spectra and search software. It contains 2,374,064 MS/MS spectra from 399,267 small molecules.
  • MoNA: MoNA currently contains 2,061,612 mass spectral records from experimental and in-silico libraries, as well as from user contributions.
  • GNPS: GNPS is a web-based mass spectrometry ecosystem that aims to be an open-access knowledge base for the community-wide organization and sharing of raw, processed, or annotated fragmentation mass spectrometry data (MS/MS).

Retention time:

  • SMRT: This dataset presents an experimentally acquired reverse-phase chromatography retention time dataset, covering up to 80,038 small molecules.
  • RepoRT: RepoRT currently contains 373 datasets, 8,809 unique compounds, and 88,325 retention time entries measured on 49 different chromatographic columns using various eluents, flow rates, and temperatures.

Collision cross section:

  • AllCCS: This collection includes more than 5,000 experimental CCS records and approximately 12 million calculated CCS values for over 1.6 million small molecules.

Papers

Survey papers

Small molecular representation learning

According to the information embedded in the model, the molecular representation learning models are categorized as point-based (or quantum-based) methods, graph-based methods, and sequence-based methods. Because the number of graph-based methods is huge, they are further divided into self-supervised learning and supervised learning manners. It is worth noting that the difference between point-based (or quantum-based) methods and graph-based methods is if bonds (i.e. edges) are included in the encoding.

Discussions in pre-train models

Point-based (or quantum-based) methods

Graph-based methods

Self-Supervised Learning:

Supervised Learning

Other Related Works

Sequence-based methods

Mass spectrometry-related properties prediction

Tandem mass spectra prediction predicton

Retetntion time prediction

Collision cross section prediction

Mass spectra representation learning and matching

Chemical formula prediction from mass spectra

Machine learning in small molecules chromatography

Mass spectrometry is often coupled with chromatographic techniques, such as GC-MS (gas chromatography-mass spectrometry) or LC-MS (liquid chromatography-mass spectrometry). In these combined techniques, the chromatographic method separates the compounds, and then the mass spectrometer analyzes each separated compound for identification and quantification.

Related awesome lists

About

Awesome papers and codes list of analytical chemistry-related deep learning methods

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published