Skip to content

kjappelbaum/awesome-chemistry-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-chemistry-datasets Awesome

Contributions are very welcome - please follow the guidelines and the Code of Conduct.

text datasets

  • BC5CDR:1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions (named entity recognition)
  • BioCreative V: BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
  • BioRxiv XML - Bulk access to the full text of bioRxiv articles for the purposes of text and data mining (TDM) is available via a dedicated Amazon S3 resource.
  • ChemTables: 788 chemical patent tables with labels of their content type. Built for semantic classification of table type. Licensed under CC BY NC 3.0.
  • Elsevier Corpus: This is a corpus of 40k (40,001) open access (OA) CC-BY articles from across Elsevier’s journals represent the first cross-discipline research of data at this scale to support NLP and ML research.
  • Europe PMC - Bulk download of full text and SI of > 5 million articles.
  • IUPAC Gold Book
  • LibreText: Open-access chemistry textbook.
  • MedRxiv XML - Text and data mining is possible via dedicated Amazon S3 resource.
  • NLM literature archive: NLM LitArch (NLM Literature Archive) is a digital archive for books, documents, and articles in the fields of life science, medicine, and healthcare at the National Institutes of Health. Also accessible via NCBI bookshelf. See also the NLMChem, a manually annotated full-text resource on chemicals in the biomedical literature. It contains 150 full-text journal articles selected both to be rich in chemical mentions and for articles where human annotation was expected to be most valuable. However, I saw NLM literature archive already on the list but wasn't sure if it included this dataset
  • OpenStax Free textbooks, including Chemistry 2e, which is released under CC-BY 4.0.
  • PubChemSTM: 281K chemical structure and text pairs
  • PubMed central: free full-text archive
  • PubMed: abstracts and outlinks
  • PubMedQA: answer research questions with yes/no/maybe using abstracts (1k expert labeled, 61.2k unlabeled and 211.3k artificially generated QA instances).
  • S2ORC: The Semantic Scholar Open Research Corpus. 81.1M English-language academic papers spanning many academic disciplines largest publicly-available collection of machine-readable academic text). Released under CC BY-NC 4.0.

structures

  • COCONUT: is an open source project for Natural Products (NPs) storage, search and analysis.
  • Crystallography Open Database: open-access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, excluding biopolymers. They also derived SMILES for some compounds.
  • Enamine HTS collection: 1 930 980 diverse screening compounds (37 billion molecules in 2D and 4.5 billion in 3D)
  • GDB: enumeration of molecules according to simple (feasibility and stability) rules
  • GNPS: mass spectrometry database with focus on natural products, contains untargeted (unlabelled) data.
  • MoNA: mass spectrometry database of real and predicted spectra for known compounds.
  • nCov-Group Data Repository: SMILES, fingerprints, descriptors, and images of millions of compounds.
  • nmrshiftdb2: is database for organic structures and their nuclear magnetic resonance (NMR) spectra.
  • zinc20: ZINC20 library prepared for Deep Docking-accelerated virtual screening
  • zinc22: commercially-available compounds for virtual screening

molecular activity prediction benchmark datsets

  • MPCD: a benchmark for molecular activty prediction, including both 9 Low-sample size and narrow-scaffold inhibitors datasets(LSSNS) and 30 Higher-sample size and mixed-scaffold inhibitor datasets(HSSMS), each dataset is visulised by TMAP
  • MoleculeACE: a benchmark (30 HSSMS datasets in MPCD) for evaluating the predictive performance on activity cliff compounds of machine learning models.

ml structure-property benchmark datasets

  • ACNet: a benchmark for Activity Cliff Prediction, 400K Matched Molecular Pairs (MMPs) against 190 targets, including over 20K MMP-cliffs and 380K non-AC MMPs from ChEMBL (version 28).
  • Aquasoldb: Curation of nine open source datasets on aqueous solubility. The authors also assigned reliability groups.
  • BigSolDB: Molecular solubility in organic solvents and water in a wide range of temperatures. It contains 830 unique molecules and 138 unique solvents. Temperatures range from 243.15K to 403.15K. Published in this paper.
  • BindingDB: molecular recognition database, contains 2.6M data for 1.1M Compounds and 8.10K Targets (Feb 2023)
  • ChEBI-20: 33,010 molecule-description pairs (for molecule captioning task)
  • ESol: Water solubility data(log solubility in mols per litre) for common organic small molecules.
  • Flashpoint: Sun et al. collected a dataset of the flashpoints of 10575 molecules from academic papers, the Gelest chemical catalogue, the DIPPR database, Lange's Handbook of Chemistry, the Hazardous Chemicals Handbook, and the PubChem database.
  • FreeSolv: Experimental and Calculated Small Molecule Hydration Free Energies
  • Harvard OPV: "experimental photovoltaic data from the literature, and corresponding quantum-chemical calculations performed over a range of geometries, each with quantum chemical results using a variety of density functionals and basis sets"
  • Hydrogen Storage Materials Database: data on hydrogen storage materials (information such as chemical formula and hydrogen capacity)
  • ILThermo: thermodynamic and transport properties of pure ionic liquids and mixtures of them.
  • Leffingwell Odor Dataset: 3523 molecules associated with expert-labeled odor descriptors from the Leffingwell PMP 2001 database
  • Limiting activity coefficients: for different solvent/solute pairs, used to train a SMILES-based transformer.
  • Lipophilicty: Experimental results of octanol/water distribution coefficient(logD at pH 7.4).
  • MD simulated monomer properties: density, cohesive energy, thermal expansion, heat of vaporization, compressibility, radius of gyration, glass transition, and diffusion constant for 410 monomers
  • MoleculeNet - Benchmark suite that contains multiple datasets listed here
  • oechem: On Feb 17 2023 OCHEM contained 3774118 records for 689 properties (with at least 50 records) collected from 20609 sources (user is granted a Creative Commons CC-BY (version 4.0) license to data submitted)
  • Papyrus: A large scale curated dataset aimed at bioactivity predictions. Contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets.
  • Photoswitch Dataset: Curated dataset of 405 photoswitch molecules.
  • QM Datasets: QM7, QM7b, QM8, QM9, MD Trajectories
  • SolProp: Database of 1 million solvent/solute COSMO-RS calculations and 10145 experimental solvation free energies (originally published as part of this paper).
  • SOMAS: Experimental and calculated solubilities for small molecules. Originally proposed for the design of redox-flow batteries.
  • Therapeutic Data Commons: ML tasks that cover small molecules and biologics, including antibodies, peptides, miRNAs, and gene editing therapies. Original data can be found here.
  • ThermoML Archive: experimental thermophysical and thermochemical property data (in ThermoML XML format)
  • LIT-PCBA: A dataset for virtual screening and machine learning. It contain 15 target sets, 7761 actives and 382674 unique inactives selected from high-confidence PubChem Bioassay data.

Target identification data

  • Open Targets: is a large-scale resource that uses human genetics and genomics data for systematic drug target identification and prioritization.
  • Probes & Drugs Portal: is an interactive, open data resource for chemical biology. Overview of libraries of bioactive compounds (e.g., ChEMBL, Guide to PHARMACOLOGY), including commercial screening libraries.

Pharmacology & ADME & Metabolism

  • SIDER dataset: The SIDER Side Effect Resource represents an effort to aggregate dispersed public information on side effects. As there is no such resource exist in machine-readable form despite the importance of research on drugs and their effects. Creation of this resource was is related to paper (Campillos, Kuhn et al., Science, 2008, 321(5886):263-6.) on the utilization of side effects for drug target prediction. Released under CC BY-NC-SA 4.0.
  • Cell Effective Permeability (Caco-2) dataset: by Wang et al. is a dataset used to measure the absorption of drugs through intestinal tissue by simulating it using a human colon epithelial cancer cell line (Caco-2).
  • Clinical Trials: single zip file containing all study records (in XML) available on ClinicalTrials.gov
  • Drug–Drug–Interaction (DDI): MedLine abstracts on drug-drug interactions as well as documents describing drug-drug interactions from the DrugBank database.
  • Drug Indications Database (DID): is a dataset of structured drug-indication relations. It is intended to facilitate the building of practical, comprehensive, integrated drug ontologies.
  • EPA CompTox: is a widely used resource for chemistry, toxicity, and exposure information for hundreds of thousands of chemicals including, but not limited to, chemical properties, environmental fate, and transport, hazard, in vitro to in vivo extrapolation (IVIVE), exposure, bioactivity (each data has its license).
  • Guide to PHARMACOLOGY: is an expert-curated resource of ligand-activity-target relationships. It includes activity data even for data with unknown bioactivity value (under CC BY-SA 4.0).
  • KD-DTI: Drug-target-interaction triplets (12K training samples, 1K validation samples and 1.1K test samples). See paper.
  • KEGG PATHWAY Database(KEGG): a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
  • LOTUS: harmonization, curation, validation and open dissemination of 750,000+ referenced structure-organism pairs (relationships between molecular structures and the living organisms from which they were identified).
  • MetXBioDB Metabolite Biotransformations: a comprehensive collection of biotransformation reactions and metabolite information from the BioTransformer database. It includes the transformation and metabolism of metabolites.
  • ONSIDES: A resource of adverse drug effects extracted from FDA structured product labels.
  • PAMPA Permeability and NCATS dataset: is a dataset of commonly employed assay to evaluate drug permeability across the cellular membrane to help in ADME prediction.
  • PsychonautWiki: catalog of mind-altering substances
  • QSAR datasets - Meta-QSAR (phase I & II): Data (extracted from ChEMBL) used in Olier et al. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery.
  • The Human Metabolome Database (HMDB): is a freely available electronic database containing detailed information about small molecule metabolites found in the human body.
  • The Metabolism and Transport Database : is a cheminformatics and bioinformatics resource that contains curated data related to human small molecule metabolism and transport.

reactions

  • USPTO: Reactions extracted by text-mining from United States patents published between 1976 and September 2016.
  • RDB7: Computational dataset with atom-mapped SMILES, barrier heights, and reaction enthalpies calculated at CCSD(T)-F12, which is known to be very accurate. Geometries are identified via the growing string method in this paper while the high-quality energies are computed in this paper.

high-throughput screening data

  • Dreher-Doyle: yields and conditions for 3955 Pd-catalysed Buchwald–Hartwig C–N crosscouplings
  • Perera: yields and conditions for 5760 Pd-catalysed Suzuki-Miyaura C-C cross-couplings

eln data

related list

License

CC0