New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kekulization / hydrogen adder problem #969
Comments
Aromatic bonds in molfiles are query features and so you get back a query molecule (or at least one with query bonds) when you try to read it. MDL never allowed aromatic bonds on input so it’s undefined what the hydrogen count is - nitrogen’s are particularly problematic. With aromatic bonds you will gets hydrogens added automatically by the reader (because there is a fixed answer). You are somewhat right that adding hydrogens with the CDK atom typing should fix it and I’ll look in to what is happening - I think the query bond object is messing things up. the short answer is you should not have aromatic bonds in a molfile unless you only want to use it to do a substructure search. |
Please also note RDKit it not the arbiter of what is “correct” behaviour - in this case MDL/Symyx/Accelerys/BIOVIA define how molfiles should be read. |
I totally agree! But well, I am just trying to work with the data that was given to me here, sorry. Thank you for looking into this matter. |
Dear CDK developers,
when I try to parse this MOL file taken from the Supernatural II natural products database, SupernaturalII_SN00236617.txt, with the following code (file ending had to be changed to .txt for pasting it here)...
... I get the following error:
The structure encoded in the file looks like this (with bond type 4 ("aromatic") in the ring and the positive charge of the nitrogen atom encoded in the properties block):
I would have thought that the structure could be kekulized to this:
What do you think about this? Would this be valid?
If I try to generate the unique SMILES code prior to the last line, I get this error:
First, I thought the problem was in the kekulization routine. But when I do this (below) instead of kekulizing...
... I get this SMILES code as output:
This makes me think that the problem is actually in the hydrogen adding. The explicit hydrogen connected to the charged nitrogen atom was not in the original MOL file (if I see it correctly) but must have been added in
tmpHAdder.addImplicitHydrogens(tmpMolecule);
This way, the nitrogen is now pentavalent and not tetravalent anymore, which is not correct, in my opinion. "No wonder", kekulization fails on this structure, I would say. But without the addition of implicit hydrogen atoms, I cannot work with the molecule. If I try to generate the unique SMILES code (with lower-case letter encoding for aromatic atoms) without adding hydrogens first, I get
java.lang.NullPointerException: One or more atoms had an undefined number of implicit hydrogens
.One additional note: RDKit seems to have no problem with this structure and kekulizes it as I would expect.
And one more note: It seems like the MOL file was created with an earlier version of CDK. But I cannot tell you more about it because the file is basically "legacy data" archived from a database that is not available anymore today.
What is your opinion on all this? Is this a bug (in kekulization or implicit hydrogen adding)? Or am I doing/seeing something wrong here? Would you have a potential fix for me?
If you would like, I could also supply you with more molecules/MOL files with the same issue.
Any help would be much appreciated!
The text was updated successfully, but these errors were encountered: