Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update atom mapping and reactant detection logic #21

Open
dswigh opened this issue Apr 8, 2023 · 1 comment
Open

Update atom mapping and reactant detection logic #21

dswigh opened this issue Apr 8, 2023 · 1 comment
Labels
documentation Improvements or additions to documentation wontfix This will not be worked on

Comments

@dswigh
Copy link
Collaborator

dswigh commented Apr 8, 2023

Atom mapping in the USPTO dataset was done using Indigo over 6 years ago (https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873), and better tools for atom mapping have since been created, e.g. rxn mapper (https://onlinelibrary.wiley.com/doi/10.1002/minf.202100138). Even though rxnmapper may be better than Indigo, the benchmarking study linked above may be slightly misleading when it comes to determining how much better rxnmapper is, because the benchmarking dataset was specifically curated to include very difficult reactions. Both tools are likely to perform very well on 'easy' reactions. On a more realistic dataset that contains both easy and hard reactions, mapping performance will likely be more similar.

With a better atom mapping, it may also be possible to expand the scope of reactant detection in a reaction string, e.g. by detecting previously unmapped atoms in the product and detecting these atoms among the agents, and then moving said agents to the reactants.

Rxnmapper is quite a heavy programme, and would take many hours to run on a few million reactions. Since the gain is likely to only be marginal coupled with us wanting to keep the programme relatively light weight, we have decided to keep the original mapping in the ORD dataset (Indigo in the case of USPTO data).

@Joearrowsmith Joearrowsmith added documentation Improvements or additions to documentation wontfix This will not be worked on labels Apr 11, 2023
@dswigh
Copy link
Collaborator Author

dswigh commented Apr 19, 2023

Here's an example of where the atom mapping fails:
Br[CH2:2][C:3]1[CH:4]=[CH:5][C:6]2[O:15][C:10]3=[N:11][CH:12]=[CH:13][CH:14]=[C:9]3C:8[C:7]=2[CH:17]=1.[CH3:18]N:19C=O.[C-]#N.[Na+]>O>C:18#[N:19]
We would expect the triple-bonded N in the product to come from the triple-bonded N in the reactant ([C-]#N). Nothing we can do about this, we are at the mercy of the existing atom-mapping in ORD.
From: uspto-grants-1976_01.parquet ("ord-cc0d0a952867484fa3eb43ab33c5c8dd") index 412

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants