Skip to content

Latest commit

 

History

History
324 lines (235 loc) · 15.7 KB

ch07_graph.asciidoc

File metadata and controls

324 lines (235 loc) · 15.7 KB

Chapter 7: Assessing similarity using graph structures

jupyter

A graph is data consisting of nodes (vertices) and edges (branches) that indicate the connection between nodes. The chemical structure can be represented by this graph. In other words, we can represent atoms in a graph with nodes and bonds as edges.

In general, fingerprints like those introduced in Chapter 6 are often used to evaluate the similarity between molecules, but there is also a method to evaluate similarity using a graph structure. The MCS (Maximum Common Substructure) introduced next refers to the common substructure of the target molecule set. The more common substructures, the more similar their molecules are.

Classification by major skeleton (MCS)

Maximum Common Substructure (MCS) is the largest common substructure in a given group of chemical structures. RDKit provides a module called rdFMCS for MCS search.

This time, we will use the file cdk2.sdf provided in rdkit as sample data for MCS search. RDConfig.RDDocsDir is a variable that represents the directory of sample data, and there is a file called cdk2.sdf under the Books/data/ directory, so set the file path with the os.path.join method. Note that os.path.join is a python built-in module for absorbing differences in os paths.

import os
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import rdFMCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
filepath = os.path.join(RDConfig.RDDocsDir, 'Book', 'data', 'cdk2.sdf')
mols = [mol for mol in Chem.SDMolSupplier(filepath)]
# 構造を確認します
Draw.MolsToGridImage(mols[:7], molsPerRow=5)
compounds

Acquires MCS using the loaded molecule. With RDKit, you can specify multiple options for how to get MCS. The following shows an example of each option.

  1. Default

  2. Any atom can be used (as long as there is an order of structure and bond)

  3. The bond order may be any (for example, benzene and cyclohexane have the same MCS)

result1 = rdFMCS.FindMCS(mols[:7])
mcs1 = Chem.MolFromSmarts (result1.smartsString)
mcs1
print(result1.smartsString)
#[#6]1:[#7]:[#6](:[#7]:[#6]2:[#6]:1:[#7]:[#6]:[#7]:2)-[#7]
MCS01
result2 = rdFMCS.FindMCS(mols[:7], atomCompare=rdFMCS.AtomCompare.CompareAny)
mcs2 = Chem.MolFromSmarts(result2.smartsString)
mcs2
print(result2.smartsString)
#[#6]-,:[#6]-,:[#6]-[#6]-[#8,#7]-[#6]1:[#7]:[#6](:[#7]:[#6]2:[#6]:1:[#7]:[#6]:[#7]:2)-[#7]
MCS02
result3 = rdFMCS.FindMCS(mols[:7], bondCompare=rdFMCS.BondCompare.CompareAny)
mcs3 = Chem.MolFromSmarts(result3.smartsString)
mcs3
print(result3.smartsString)
#[#6]1:[#7]:[#6](:[#7]:[#6]2:[#6]:1:[#7]:[#6]:[#7]:2)-[#7]
MCS03

In RDKit, Fraggle Similarity is implemented as an algorithm to quantify similarity based on MCS. By using this, clustering and analysis based on similarity can be performed.

from rdkit.Chem.Fraggle import FraggleSim
sim, match = FraggleSim.GetFraggleSimilarity(mols[0], mols[1])
print(sim, match)
#0.925764192139738 *C(C)C.*COc1nc(N)nc2[nH]cnc12
match_st = Chem.MolFromSmiles(match)
match_st
FraggleSimilarity

Thus, FraggleSimilarity returns similarities and matched substructures. It is often closer to the feeling of a chemist than the similarity using ECFP. Please refer to the reference link for details.

Reference link

Matched Molecular Pair and Matched Molecular Series

jupyter

At the structural optimization stage of drug discovery research, how to convert the starting compound (lead compound) is very important, but as the stage progresses, which structural conversion affects the activity and physical properties It is also very important to carry out a retrospective analysis of what it has exerted.

Matched Molecular Pair (MMP) is a pair of molecules that differ only in the partial structure of some of the two molecules but are otherwise identical. As an example, chlorobenzene and fluorobenzene are MMPs because they differ only in Cl and F groups. By analyzing a large number of changes in the characteristics of such pairs, you can grasp the trend of substituent conversion. This is called Matched Molecular Pair Analyisis (MMPA). By performing MMPA on large-scale data, it is possible to extract universal rules for property changes caused by substituent changes. If you understand these rules, you will be able to proceed efficiently with structural optimization.

Here, we analyze MMP using RDKit/Contrib/MMPA[mmpa] provided in Contrib of RDKit.

Move to Contrib/mmpa under the RDKit installation location and execute the python script sequentially.

python rfrag.py <name of the File you want to implement the MMPA> #save file name of the data that was fragmented
# For example
# python rfrag.py <data/sample.smi >data/sample_fragmented.txt

python indexing.py <can be in the previous command Fragment file > MMP_ output file.CSV
# eg
# python index.py <data/sample_fragmented.txt >data/mmp.csv

Executing the above command will generate a csv file of molecule A, molecule B, ID of molecule A, ID of molecule B, SMIRKS of converted structure, and common part structure (context). MMPA can be performed by linking activity and physical properties based on this data.

Note
SMIRKS is a method to express conversion of molecules by string notation like SMILES.

A method called Matched Molecular Series (MMS) has also been proposed as an extension of MMP. Although MMP is a pair of molecules, MMS is a list of this pair as a group of three or more with common structure.

I will actually make MMS. The following example uses data from Factor Xa in ChEMBL. For the implementation of MMS, we use the code of the presentation by Noel O’Byle’s RDKit UGM.

Let’s actually make an MMS. In the following example, Factor Xa data was downloaded from ChEMBL and used as an example. For the implementation of MMS, we use the code of the presentation by Noel O’Byle’s RDKit UGM .

First, loading the library to be used, loading the data, and desalting using SaltRemover.

import sys
import os
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdMMPA
from rdkit.Chem import RDConfig
from rdkit.Chem import rdBase
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import SaltRemover
mmpapath = os.path.join(RDConfig.RDContribDir, 'mmpa')
sys.path.append(mmpapath)
df = pd.read_csv('Chembl_FXa.txt', sep='\t')
remover = SaltRemover.SaltRemover()
mols = []
for i, smi in enumerate(df.CANONICAL_SMILES):
    try:
        mol = Chem.MolFromSmiles(smi)
        mol.SetProp('CMPD_CHEMBLID', df.CMPD_CHEMBLID[i])
        mol = remover.StripMol(mol)
        mols.append(mol)
    except:
        print(smi)

Then, import the mmpa rfrag registered in RDKit contrib, and divide the molecule into fragments.

import rfrag
rfragdata = []
for i, smi in enumerate(df.CANONICAL_SMILES):
    try:
        out = rfrag.fragment_mol(smi, df.CMPD_CHEMBLID[i])
        rfragdata.append(out)
    except:
        print(smi, df.CMPD_CHEMBLID[i])

Define a function to create an MMS. The code is almost the same as that described in the UGM document, but I changed the reading destination from a file to a list in order to do all processing on Jupyter.

Here is an overview of the MMS creation process.

  1. Cut each molecule according to a certain rule (cut by rotatable bond etc.)

  2. Cut fragments create a dictionary of keys, store the fragments of molecules with the same key in the dictionary value

By repeating the above process, molecules with common scaffold can be organized. Molecules that are grouped in a common scaffold will be molecules that have different non-scaffold substituents.

What is a scaffold?

In drug discovery, there is a stage of structural optimization at the stage before preclinical studies, in which the major non-skeleton part of the compound is converted briefly into a balanced property suitable for drugs.

This main skeleton is called a scaffold. For example, in this patent, the part except R is fixed and this main skeleton is called a scaffold.

scaffold
from collections import namedtuple

Frag = namedtuple( 'Frag', ['id', 'scaffold', 'rgroup'] )

class Series():
    def __init__( self ):
        self.rgroups = []
        self.scaffold = ""

def getFrags(rfrags):
    frags = []
    for lines in rfrags:
        for line in lines:
            broken = line.rstrip().split(",")
            if broken[2]: # single cut
                continue
            smiles = broken[-1].split(".")
            mols = [Chem.MolFromSmiles( smi ) for smi in smiles]
            numAtoms = [mol.GetNumAtoms() for mol in mols]
            if len(numAtoms) < 2:
                continue
            if numAtoms[0] > 5 and numAtoms[1] < 12:
                frags.append(Frag(broken[1], smiles[0], smiles[1]))
            if numAtoms[1] > 5 and numAtoms[0] < 12:
                frags.append(Frag(broken[1], smiles[1], smiles[0]))
    frags.sort(key=lambda x:(x.scaffold, x.rgroup))
    return frags

def getSeries(frags):
    oldfrag = Frag(None, None, None)
    series = Series()
    for frag in frags:
        if frag.scaffold != oldfrag.scaffold:
            if len(series.rgroups) >= 2:
                series.scaffold = oldfrag.scaffold
                yield series
            series = Series()
        series.rgroups.append((frag.rgroup, frag.id))
        oldfrag = frag
    if len(series.rgroups) >= 2:
        series.scaffold = oldfrag.scaffold
        yield series

We are ready to make an MMS. Visualize only data that has four or more substituent conversions for the same scaffold.

frags = getFrags(rfragdata)
series = getSeries(frags)
series =[i for i in series]
from IPython.display import display
for s in series[:50]:
    mols = [Chem.MolFromSmiles(s.scaffold)]
    ids = ['scaffold']
    for r in s.rgroups:
        rg = Chem.MolFromSmiles(r[0])
        mols.append(rg)
        ids.append(r[1])
    if len(mols) > 5:
        display(Draw.MolsToGridImage(mols, molsPerRow=5, legends=ids))
        print("########")
MMS

Five scaffolds for MMS were displayed for the scaffold.

Note
Activity prediction can also be performed using this MMS.

Visualize MMP networks using Cytoscape

Warning
This content is beyond the content of the introductory, so please skip if you are not interested.

MMP can be thought of as a graph structure that uses pre-conversion and post-conversion information as nodes and conversion rules as edges. This graph structure can be intuitively understood by using network visualization tools such as Cytoscape.

In addition to the MMPA introduced earlier, RDKit has another project called mmpdb. It is provided as a command line tool group and database system, so it has the feature of being easy to manage in the long run. In this section, we introduce the visualization of MMP using mmpdb and Cytoscape.

Cytoscape installation

Cytoscape is an open source network visualization software widely used in various scenes. You can display the structure network by using the compound structure display plug-in.

Installation is as easy as downloading the corresponding OS installer from the download site and installing according to the instructions.

When installation is complete, launch Cytoscape and install the Chemviz2 plug-in for drawing compound structures. The procedure is easy, select chemviz2 from Apps → App Manager and install it.

AppManager

create a gml file from mmpdb

The data to be used this time are 151 compounds of <Inhibition of recombinant GSK3-beta> J. Med. Chem. (2008) 51: 2062-2077 . In principle, MMPA does not use HTS-like search data but scaffolds such as structure optimization.

I will put the flow of the command. SMILES text and activity and property data need to be registered separately in the database.

$ mmpdb fragment smiles.txt -o CHEMBL930273.fragments     # fragmentation
$ mmpdb index CHEMBL930273.fragments -o CHEMBL930273.db   # make db
$ mmpdb loadprops -p act.txt CHEMBL930273.db              # load properties

After that we will create a gml file for reading by Cytoscape, but this is beyond the scope of this document and will be omitted. If you are interested, you may want to read the code directly, but the flow is as follows.

  1. Make a gml file using mmpdb and python-igraph

  2. Read gml file by Cytoscape

  3. Assign attributes to each parameter in Cytoscape to make it easier to understand visually

    1. Corresponds to the physical value of the node size

    2. Corresponds to the active color of the edge color

    3. Draw a structure with chemviz2 plugin and paste it to a node

Interpretation

Let’s look at the MMP network. MMP with little difference in activity is solidified in the upper left. In the lower right, red edges (a large difference in activity) are observed. MMPs are also called Activity Cliffs, even if such small substituent changes produce large activity differences. It is important not to overlook such changes in activity, as Activity Cliff is generally a breakthrough in drug discovery projects.

MMPN MMPN

It has been found that the substitution of the OH group with the MeO group causes the loss of activity when we actually confirm what substitution has been made.

Since MMP alone can simply know the facts like this, I searched for a complex crystal structure of the analogue in order to consider it a little deeper. Then , a complex of GDB3β and a similar compound was found as PDBID:5OY4.

MMPN MMPN

If you replace the OH group with the MeO group, it will likely hit the wall of the pocket. In other words, this Activity Cliff is considered to be caused by steric hindrance of ligand and protein.