Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kekulization / hydrogen adder problem #969

Open
JonasSchaub opened this issue Apr 14, 2023 · 3 comments
Open

Kekulization / hydrogen adder problem #969

JonasSchaub opened this issue Apr 14, 2023 · 3 comments

Comments

@JonasSchaub
Copy link
Contributor

Dear CDK developers,

when I try to parse this MOL file taken from the Supernatural II natural products database, SupernaturalII_SN00236617.txt, with the following code (file ending had to be changed to .txt for pasting it here)...

ClassLoader tmpClassLoader = this.getClass().getClassLoader();
FileReader tmpFileReader = new FileReader(tmpClassLoader.getResource("SupernaturalII_SN00236617.mol").getFile());
MDLV2000Reader tmpMolReader = new MDLV2000Reader(tmpFileReader);
IAtomContainer tmpMolecule = DefaultChemObjectBuilder.getInstance().newInstance(IAtomContainer.class);
tmpMolecule = tmpMolReader.read(tmpMolecule);
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(tmpMolecule);
CDKHydrogenAdder tmpHAdder = CDKHydrogenAdder.getInstance(SilentChemObjectBuilder.getInstance());
tmpHAdder.addImplicitHydrogens(tmpMolecule);
Kekulization.kekulize(tmpMolecule);

... I get the following error:

org.openscience.cdk.exception.CDKException: Cannot assign Kekulé structure without randomly creating radicals.

The structure encoded in the file looks like this (with bond type 4 ("aromatic") in the ring and the positive charge of the nitrogen atom encoded in the properties block):
image

I would have thought that the structure could be kekulized to this:
image

What do you think about this? Would this be valid?

If I try to generate the unique SMILES code prior to the last line, I get this error:

org.openscience.cdk.exception.CDKException: Cannot write Kekulé SMILES output due to aromatic bond with unset bond order - molecule should be Kekulized

First, I thought the problem was in the kekulization routine. But when I do this (below) instead of kekulizing...

SmilesGenerator tmpSmiGen = new SmilesGenerator(SmiFlavor.Unique | SmiFlavor.UseAromaticSymbols);
System.out.println(tmpSmiGen.create(tmpMolecule));

... I get this SMILES code as output:

OC1CCC(NC1C)CCCCCCCCCCc2cc(c3CCC[nH+]3c2)CCCCCCCCCCC4NC(C)C(O)CC4

This makes me think that the problem is actually in the hydrogen adding. The explicit hydrogen connected to the charged nitrogen atom was not in the original MOL file (if I see it correctly) but must have been added in tmpHAdder.addImplicitHydrogens(tmpMolecule);
This way, the nitrogen is now pentavalent and not tetravalent anymore, which is not correct, in my opinion. "No wonder", kekulization fails on this structure, I would say. But without the addition of implicit hydrogen atoms, I cannot work with the molecule. If I try to generate the unique SMILES code (with lower-case letter encoding for aromatic atoms) without adding hydrogens first, I get java.lang.NullPointerException: One or more atoms had an undefined number of implicit hydrogens.

One additional note: RDKit seems to have no problem with this structure and kekulizes it as I would expect.
And one more note: It seems like the MOL file was created with an earlier version of CDK. But I cannot tell you more about it because the file is basically "legacy data" archived from a database that is not available anymore today.

What is your opinion on all this? Is this a bug (in kekulization or implicit hydrogen adding)? Or am I doing/seeing something wrong here? Would you have a potential fix for me?

If you would like, I could also supply you with more molecules/MOL files with the same issue.

Any help would be much appreciated!

@johnmay
Copy link
Member

johnmay commented Apr 14, 2023

Aromatic bonds in molfiles are query features and so you get back a query molecule (or at least one with query bonds) when you try to read it. MDL never allowed aromatic bonds on input so it’s undefined what the hydrogen count is - nitrogen’s are particularly problematic. With aromatic bonds you will gets hydrogens added automatically by the reader (because there is a fixed answer).

You are somewhat right that adding hydrogens with the CDK atom typing should fix it and I’ll look in to what is happening - I think the query bond object is messing things up.

the short answer is you should not have aromatic bonds in a molfile unless you only want to use it to do a substructure search.

@johnmay
Copy link
Member

johnmay commented Apr 14, 2023

Please also note RDKit it not the arbiter of what is “correct” behaviour - in this case MDL/Symyx/Accelerys/BIOVIA define how molfiles should be read.

@JonasSchaub
Copy link
Contributor Author

the short answer is you should not have aromatic bonds in a molfile unless you only want to use it to do a substructure search.
Please also note RDKit it not the arbiter of what is “correct” behaviour - in this case MDL/Symyx/Accelerys/BIOVIA define how molfiles should be read.

I totally agree! But well, I am just trying to work with the data that was given to me here, sorry.

Thank you for looking into this matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants