Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Molecule search unable to find some CAS number. #76

Closed
chatelak opened this issue Oct 16, 2014 · 3 comments
Closed

Molecule search unable to find some CAS number. #76

chatelak opened this issue Oct 16, 2014 · 3 comments

Comments

@chatelak
Copy link

Just as information, I don't know If we can really to something to fix it. Like this it will be a least documented.
It is just to point out some strange behavior in the molecule search tool.

Some CAS number are just not recognized but as suggested @connie it is because they are not listed in NIST and web tool is based on NIST database.
I found three different cases:
First list: (not working but does not exist on nist so it makes sense)
2143-69-3
1981-80-2
6067-68-1
15552-77-9
67152-18-5
108179-96-0
86181-68-2
687-97-4
2810-61-9
63707-54-0
309966-76-5
Most of them are radicals, I don't know if it is a reason why their representations are hard to find in litterature. I finally found most of them thanks to their name and "CAS #" in Burcat's database.

Second list: (not recognized at the beginning and then find it after drawing the molecule)
2143-58-0 (exists on NIST but no representation given)
436-51-4 (does not exit on NIST)

Third list:
53561-65-2 (exist on NIST and not found by the tool)

The second list behavior is really strange:
The second list is species unrecognized at the beginning, but after giving the adjacency list by hand, the tool displayed some informaiton on the molecule instead of an error (as it displayed for the first list). And I was very surprised to find my CAS number in those information. And of course after that the CAS number is now recognized. After finding this for the first molecuel, I tried to import 3 times a CAS number before trying to draw the molecule by hand. And I found the same behavior a second time with: (436-51-4).

The third list is just because were are not dynamically linked to nist database. So it is not really a problem neither.

@rwest
Copy link
Member

rwest commented Oct 16, 2014

When you enter stuff in the form we try to interpret it as a SMILES string, and if that fails we assume it's a name, requiring a database lookup, which we do using
http://cactus.nci.nih.gov/chemical/structure/

We get the SMILES string via
http://cactus.nci.nih.gov/chemical/structure/67-56-1/SMILES
where 67-56-1 is replaced by whatever you are searching for.

You can find a bit more info about how it was resolved if you visit
http://cactus.nci.nih.gov/chemical/structure/67-56-1/SMILES/xml

For example
http://cactus.nci.nih.gov/chemical/structure/methanol/SMILES/xml
shows it matches both the name_by_opsin and the name_by_cir resolvers.

If you want to force it to use a specific resolver you can request it like this
http://cactus.nci.nih.gov/chemical/structure/67-56-1/SMILES/xml?resolver=cas_number

There is no algorithm to get from CAS numbers to species, so if it's not in the database that http://cactus.nci.nih.gov/ used, then there's not a lot we can do. You could try contacting them to ask where they got their CAS numbers and if they ever update, but I'm guessing the species you list are not "new discoveries", so being out of date is probably not the issue.

If you have CAS numbers for your species then you must have gotten them from somewhere. If that place can give you the SMILES string, use that, because we can interpret any (valid) smiles without a database lookup. InChI would also work; it is still resolved via http://cactus.nci.nih.gov/chemical/structure/ but should be robust because it is algorithmic and doesn't require a database hit.

Summary: use InChI or SMILES not CAS numbers whenever possible, because CAS numbers need a big database and are not unique (although I know a lot of NIST kinetics database uses CAS numbers...)

@chatelak
Copy link
Author

You are right it is not new discoveries.

All my molecules came from a thermo.dat file, I was very suprised to get almost a CAS number for every species. So you can imagine there is nothing else (inchi or smiles provided) to identify better the molecule.

Thanks for the explanation of how the form works. I will try some mote test with cactus website if I can reproduce the second list behavior.

I thought CAS # was the safest way to describe molecule but as you showed me here it is not as robust as algorithmic approach. I think I will put inchi or smiles in my thermo.dat files now.
Many thanks for the advice.

@jonwzheng
Copy link
Contributor

Closing stale issue. BTW, see #258 - there is a somewhat more robust API for accessing CAS numbers.
The best way of searching CAS is using CAS SciFinder, but API access is restricted for our educational license. You can still use CAS SciFinder to do manual search of your molecule-of-interest and get its SMILES string, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants