Kekulization / hydrogen adder problem #969

JonasSchaub · 2023-04-14T14:52:16Z

Dear CDK developers,

when I try to parse this MOL file taken from the Supernatural II natural products database, SupernaturalII_SN00236617.txt, with the following code (file ending had to be changed to .txt for pasting it here)...

ClassLoader tmpClassLoader = this.getClass().getClassLoader();
FileReader tmpFileReader = new FileReader(tmpClassLoader.getResource("SupernaturalII_SN00236617.mol").getFile());
MDLV2000Reader tmpMolReader = new MDLV2000Reader(tmpFileReader);
IAtomContainer tmpMolecule = DefaultChemObjectBuilder.getInstance().newInstance(IAtomContainer.class);
tmpMolecule = tmpMolReader.read(tmpMolecule);
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(tmpMolecule);
CDKHydrogenAdder tmpHAdder = CDKHydrogenAdder.getInstance(SilentChemObjectBuilder.getInstance());
tmpHAdder.addImplicitHydrogens(tmpMolecule);
Kekulization.kekulize(tmpMolecule);

... I get the following error:

org.openscience.cdk.exception.CDKException: Cannot assign Kekulé structure without randomly creating radicals.

The structure encoded in the file looks like this (with bond type 4 ("aromatic") in the ring and the positive charge of the nitrogen atom encoded in the properties block):

I would have thought that the structure could be kekulized to this:

What do you think about this? Would this be valid?

If I try to generate the unique SMILES code prior to the last line, I get this error:

org.openscience.cdk.exception.CDKException: Cannot write Kekulé SMILES output due to aromatic bond with unset bond order - molecule should be Kekulized

First, I thought the problem was in the kekulization routine. But when I do this (below) instead of kekulizing...

SmilesGenerator tmpSmiGen = new SmilesGenerator(SmiFlavor.Unique | SmiFlavor.UseAromaticSymbols);
System.out.println(tmpSmiGen.create(tmpMolecule));

... I get this SMILES code as output:

OC1CCC(NC1C)CCCCCCCCCCc2cc(c3CCC[nH+]3c2)CCCCCCCCCCC4NC(C)C(O)CC4

This makes me think that the problem is actually in the hydrogen adding. The explicit hydrogen connected to the charged nitrogen atom was not in the original MOL file (if I see it correctly) but must have been added in tmpHAdder.addImplicitHydrogens(tmpMolecule);
This way, the nitrogen is now pentavalent and not tetravalent anymore, which is not correct, in my opinion. "No wonder", kekulization fails on this structure, I would say. But without the addition of implicit hydrogen atoms, I cannot work with the molecule. If I try to generate the unique SMILES code (with lower-case letter encoding for aromatic atoms) without adding hydrogens first, I get java.lang.NullPointerException: One or more atoms had an undefined number of implicit hydrogens.

One additional note: RDKit seems to have no problem with this structure and kekulizes it as I would expect.
And one more note: It seems like the MOL file was created with an earlier version of CDK. But I cannot tell you more about it because the file is basically "legacy data" archived from a database that is not available anymore today.

What is your opinion on all this? Is this a bug (in kekulization or implicit hydrogen adding)? Or am I doing/seeing something wrong here? Would you have a potential fix for me?

If you would like, I could also supply you with more molecules/MOL files with the same issue.

Any help would be much appreciated!

The text was updated successfully, but these errors were encountered:

johnmay · 2023-04-14T19:29:37Z

Aromatic bonds in molfiles are query features and so you get back a query molecule (or at least one with query bonds) when you try to read it. MDL never allowed aromatic bonds on input so it’s undefined what the hydrogen count is - nitrogen’s are particularly problematic. With aromatic bonds you will gets hydrogens added automatically by the reader (because there is a fixed answer).

You are somewhat right that adding hydrogens with the CDK atom typing should fix it and I’ll look in to what is happening - I think the query bond object is messing things up.

the short answer is you should not have aromatic bonds in a molfile unless you only want to use it to do a substructure search.

johnmay · 2023-04-14T19:31:54Z

Please also note RDKit it not the arbiter of what is “correct” behaviour - in this case MDL/Symyx/Accelerys/BIOVIA define how molfiles should be read.

JonasSchaub · 2023-04-17T13:13:09Z

the short answer is you should not have aromatic bonds in a molfile unless you only want to use it to do a substructure search.
Please also note RDKit it not the arbiter of what is “correct” behaviour - in this case MDL/Symyx/Accelerys/BIOVIA define how molfiles should be read.

I totally agree! But well, I am just trying to work with the data that was given to me here, sorry.

Thank you for looking into this matter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kekulization / hydrogen adder problem #969

Kekulization / hydrogen adder problem #969

JonasSchaub commented Apr 14, 2023

johnmay commented Apr 14, 2023

johnmay commented Apr 14, 2023 •

edited

JonasSchaub commented Apr 17, 2023

Kekulization / hydrogen adder problem #969

Kekulization / hydrogen adder problem #969

Comments

JonasSchaub commented Apr 14, 2023

johnmay commented Apr 14, 2023

johnmay commented Apr 14, 2023 • edited

JonasSchaub commented Apr 17, 2023

johnmay commented Apr 14, 2023 •

edited