Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for non-deterministic CIP designation bug #1027

Merged
merged 5 commits into from
Dec 6, 2023
Merged

Conversation

tylerperyea
Copy link
Contributor

This PR is to correct a CIP bug in CDK. CDK's default CIP rule works well, but has a bug in the recursive part where, if there's a tie between 2 ligands based on atomic number/mass, it will then get all sub-ligands and compare them. The issue is that each set of sub-ligands is sorted by a mass+atomic number priority ONLY, and each sub-ligand from the parent ligand is compared only to its matched counterpart, not to all potentially higher priority ligands. This can lead to wrongly assigning CIP designations.

Consider this case below (4S)-2,3,4,5-tetramethyloctane:

  
     3'    5'
     M     C
     |  4  | 
 M-C-C-[C]-C-C-C-M
   | 3  :  5
   M    M
        4'
 
 M = Methyl
 C = Carbon
 : = Dashed bond
 # = Locant 

The 4 postion [C] is a stereo center with S configuration (as demonstrated with the : dashed bond to the methyl group below). The carbon atoms at 3, 4 and 5 positions are tied for priority based on first pass CIP rules (atom number and mass). However, 3 and 5 are quickly seen as higher priority than 4' in the first tie-break as 4' has no sub-ligands. 3 and 5, however, both have 2 sub-ligands. 3 has [3',2] and 5 has [5',6] as sub-ligands. The next step CDK CIP rules do is to ORDER these sub-ligands and then compare them. In this case, we need to order [3',2] and [5',6] individually. For [3',2], this is done by comparing only the ATOMIC NUMBER+MASS between 3' and 2, and since they are both carbon atoms, this will result in a tie. This means that if the starting order of [3',2] was [3',2] it will remain that way. If the starting order is [2,3'] it will remain [2,3']. Since the input order of the ligands is based on read-order (order of bonds in the bond table in the CTAB, for example), this means the ordering of the sub-ligands is effectively non-deterministic. 2 is certainly higher priority than 3' in full CIP rules, and 6 is higher priority than 5' as well. In the above cases there are 4 ways the ligand ordering can go:

 Possibility 1: both sub-ligands in right order
    3 vs 5 =>
    [2,3'] vs [6,5'] => 
    2 vs 6 => 2 is higher priority
    therefore 3 is higher priority (correct)
 
 Possibility 2: 3 sub-ligands wrong order, 5 right order
    3 vs 5 =>
    [3',2] vs [6,5'] => 
    3' vs 6 => 6 is higher priority
    therefore 5 is higher priority (incorrect)
 
 Possibility 3: 3 sub-ligands RIGHT order, 5 wrong order
    3 vs 5 =>
    [2,3'] vs [5',6] => 
    2 vs 5' => 2 is higher priority
    therefore 3 is higher priority (correct)
 
 Possibility 4: both sub-ligands in WRONG order
    3 vs 5 =>
    [3',2] vs [5',6] => 
    3' vs 5' => TIE, go to next
    2 vs 6 => 2 is higher priority
    therefore 3 is higher priority (correct)

Notice here that 3 of the 4 possibilities above will give the correct CIP ordering. This is indeed typical. The bug only applies in circumstances where 2 or more "principal" ligands are tied for CIP priority based on atom number and atomic weight AND those 2 ligands both have sub-ligands (non-hydrogens) AND at least 2 sub-ligands of those ligands are tied for atomic weight and atomic mass. Even still, the majority of the time the criteria is met, CDK will still produce the correct CIP labeling. Nevertheless, wrong stereo assignments are very possible and have been observed fairly regularly due to this bug.

The fix present in this PR is simply to have sub-ligands sorted by the full CIP rules. A test was added which permutes the order of the bonds in the case above, and ensures that no matter what order the bonds come in, the CIP designations are always the same. This test did not pass before the fix.

@johnmay
Copy link
Member

johnmay commented Dec 5, 2023

Hi thanks for the patch but we should deprecate this code since https://github.com/SiMolecule/centres provides a much more complete implementation of the rules. I'll take a look but the current CDK implementation was lacking in many areas.

Paper - https://chemrxiv.org/engage/chemrxiv/article-details/60c73e14337d6ca46ae26282.

@johnmay
Copy link
Member

johnmay commented Dec 5, 2023

The patch broke compilation, looks like you would need to add the import for the SDF reader?

@tylerperyea
Copy link
Contributor Author

Thanks for pointing to the more complete CIP rules. That is certainly what the molwitch-cdk should use in the future.

As for the build failing ... that's odd ... the pom.xml adds the ctab dependency in the test scope so that shouldn't be an issue ...

The build error I see is this:

org.openscience.cdk.io.MDLReader ERROR: Error while parsing line 4:   6  6  0  0  0  0            999 V2000 -> This file must be read with the MDLV2000Reader.
org.openscience.cdk.io.iterator.IteratingSDFReader ERROR: Error while reading next molecule: String index out of range: 3
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 13:   1  2  1  0  0  0  0 -> Atom property ZZC is illegal in STRICT mode
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 2[8](https://github.com/cdk/cdk/actions/runs/7100792671/job/19328651406?pr=1027#step:4:9):   [9](https://github.com/cdk/cdk/actions/runs/7100792671/job/19328651406?pr=1027#step:4:10) 12  1  0  0  0  0 -> Sgroups must first be defined by a STY property
org.openscience.cdk.config.IsotopeFactory ERROR: Could not find major isotope for: [10](https://github.com/cdk/cdk/actions/runs/7100792671/job/19328651406?pr=1027#step:4:11)6
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 5:    -0.2891    0.8673    0.0000 Sg  1  0  0  0  0  0  0  0  0  0  0  0 -> Unstable use of mass delta on Sg please use M  ISO
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 17:   1  8  1  0  0  0  0 -> Invalid atom index in bond block in line 17:   1  8  1  0  0  0  0
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 62:   0  0  1  0  0  0  0 -> Invalid atom index in bond block in line 62:   0  0  1  0  0  0  0
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 28:   9 [12](https://github.com/cdk/cdk/actions/runs/7100792671/job/19328651406?pr=1027#step:4:13)  1  0  0  0  0 -> Unknown SCN type (expected: HH, HT, or EU) was ??
org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 6:     3.0739    0.[15](https://github.com/cdk/cdk/actions/runs/7100792671/job/19328651406?pr=1027#step:4:16)50    0.0000 D   1  0  0  0  0  0  0  0  0  0  0  0 -> invalid symbol: D
org.openscience.cdk.io.MDLV[20](https://github.com/cdk/cdk/actions/runs/7100792671/job/19328651406?pr=1027#step:4:21)00Writer ERROR: Bond at idx=0 is not supported by Molfile, bond=QUADRUPLE
org.openscience.cdk.io.MDLV2000Writer ERROR: Bond at idx 0 was an unspecific aromatic bond which should only be used for queries in Molfiles. These can be written if desired by enabling the option 'WriteAromaticBondTypes'.
org.openscience.cdk.io.cml.CMLCoreModule ERROR: Could not find crystal unit cell parameters
org.openscience.cdk.io.cml.CMLCoreModule ERROR: Could not find crystal unit cell parameters
org.openscience.cdk.io.FormatFactory ERROR: cannot parse chemical file formatFor input string: "C(#N)c1cc(c([nH]c1=O)C)c1ccncc1"
org.openscience.cdk.io.PMPReader ERROR: Cannot construct PMP object type: Anchor
org.openscience.cdk.io.PMPReader ERROR: Cannot construct PMP object type: Anchor
org.openscience.cdk.dict.DictionaryDatabase ERROR: Could not read dictionary org/openscience/cdk/dict/data/elements.owl
Error: The operation was canceled.

None of that is related to the test code or change I've added, so it's pretty surprising. Is it possible that these ERRORs are just verbose logging that don't actually represent a failed compilation or failed test?

Can you try the compilation again and see if it finishes on its own if not cancelled?

@tylerperyea
Copy link
Contributor Author

Wait, no, this is my fault. I copied and pasted from a local branch too hastily and didn't add the imports. Let me fix that.

@tylerperyea
Copy link
Contributor Author

Should work now

@johnmay johnmay merged commit f9dd51c into cdk:main Dec 6, 2023
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants