-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for non-deterministic CIP designation bug #1027
Conversation
Hi thanks for the patch but we should deprecate this code since https://github.com/SiMolecule/centres provides a much more complete implementation of the rules. I'll take a look but the current CDK implementation was lacking in many areas. Paper - https://chemrxiv.org/engage/chemrxiv/article-details/60c73e14337d6ca46ae26282. |
The patch broke compilation, looks like you would need to add the import for the SDF reader? |
Thanks for pointing to the more complete CIP rules. That is certainly what the molwitch-cdk should use in the future. As for the build failing ... that's odd ... the pom.xml adds the ctab dependency in the test scope so that shouldn't be an issue ... The build error I see is this:
None of that is related to the test code or change I've added, so it's pretty surprising. Is it possible that these ERRORs are just verbose logging that don't actually represent a failed compilation or failed test? Can you try the compilation again and see if it finishes on its own if not cancelled? |
Wait, no, this is my fault. I copied and pasted from a local branch too hastily and didn't add the imports. Let me fix that. |
Should work now |
This PR is to correct a CIP bug in CDK. CDK's default CIP rule works well, but has a bug in the recursive part where, if there's a tie between 2 ligands based on atomic number/mass, it will then get all sub-ligands and compare them. The issue is that each set of sub-ligands is sorted by a mass+atomic number priority ONLY, and each sub-ligand from the parent ligand is compared only to its matched counterpart, not to all potentially higher priority ligands. This can lead to wrongly assigning CIP designations.
Consider this case below (4S)-2,3,4,5-tetramethyloctane:
The 4 postion [C] is a stereo center with S configuration (as demonstrated with the : dashed bond to the methyl group below). The carbon atoms at 3, 4 and 5 positions are tied for priority based on first pass CIP rules (atom number and mass). However, 3 and 5 are quickly seen as higher priority than 4' in the first tie-break as 4' has no sub-ligands. 3 and 5, however, both have 2 sub-ligands. 3 has [3',2] and 5 has [5',6] as sub-ligands. The next step CDK CIP rules do is to ORDER these sub-ligands and then compare them. In this case, we need to order [3',2] and [5',6] individually. For [3',2], this is done by comparing only the ATOMIC NUMBER+MASS between 3' and 2, and since they are both carbon atoms, this will result in a tie. This means that if the starting order of [3',2] was [3',2] it will remain that way. If the starting order is [2,3'] it will remain [2,3']. Since the input order of the ligands is based on read-order (order of bonds in the bond table in the CTAB, for example), this means the ordering of the sub-ligands is effectively non-deterministic. 2 is certainly higher priority than 3' in full CIP rules, and 6 is higher priority than 5' as well. In the above cases there are 4 ways the ligand ordering can go:
Notice here that 3 of the 4 possibilities above will give the correct CIP ordering. This is indeed typical. The bug only applies in circumstances where 2 or more "principal" ligands are tied for CIP priority based on atom number and atomic weight AND those 2 ligands both have sub-ligands (non-hydrogens) AND at least 2 sub-ligands of those ligands are tied for atomic weight and atomic mass. Even still, the majority of the time the criteria is met, CDK will still produce the correct CIP labeling. Nevertheless, wrong stereo assignments are very possible and have been observed fairly regularly due to this bug.
The fix present in this PR is simply to have sub-ligands sorted by the full CIP rules. A test was added which permutes the order of the bonds in the case above, and ensures that no matter what order the bonds come in, the CIP designations are always the same. This test did not pass before the fix.