This dataset contains coarse-grained (CG) mappings of 1206 organic molecules with less than 25 heavy atoms. Each molecule was downloaded from the PubChem database as SMILES. One molecule was assigned to two annotators to compare the human agreement between CG mappings. Downloaded SMILES were hand-mapped using a web-app developed by The White Lab. The completed annotations were reviewed by a third person, to identify and remove unreasonable mappings (eg: one bead mappings) which did not agree with the given guidelines. Hence, there are 1.68 annotations per molecule in the current database (16% removed).
Images of the generated mapping are given in the figures. These images were generated using RDKit software.
We store a molecular graph structure and its mapping in JSON format. Here is an example (some content is omited):
{
"cgnodes": [[0,1,2], [3,4,5]], # [[fg_id...], [fg_id...]],
"nodes": [
{
"cg":2, # cg group_id (starts with 0)
"element":"C", # atom type
"id":0 # fg id
},
{...}
],
"edges": [
{
"source":0, # from fg_id
"target":1 # to fg_id
"bondtype": 1.0 # bond type (1.0, 1.5, 2.0, 3.0)
},
{...}
],
"smiles": "C[Si](C)O[Si](C)(CCl)F"
}
Based on HAM dataset, we propose Deep Supervised Graph Partitioning Model (DSGPM) (see citation below) for predicting CG mappings of unseen molecular graphs. The code of DSGPM can be seen here: https://github.com/rochesterxugroup/DSGPM .
Please cite our paper if you use our dataset:
@article{li_2020_chem_sci,
author = {Li, Zhiheng and Wellawatte, Geemi P. and Chakraborty, Maghesree and Gandhi, Heta A and Xu, Chenliang and White, Andrew D.},
journal = {Chemical Science},
title = {Graph Neural Network Based Coarse-Grained Mapping Prediction},
year = {2020}
}