Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_deterministic_sssr is not really deterministic #2562

Open
xiaoruiDong opened this issue Oct 24, 2023 · 0 comments
Open

get_deterministic_sssr is not really deterministic #2562

xiaoruiDong opened this issue Oct 24, 2023 · 0 comments
Labels
bug bug which will never be closed by the actions bot Complexity: High Priority: Low Topic: Thermo

Comments

@xiaoruiDong
Copy link
Contributor

xiaoruiDong commented Oct 24, 2023

A comment was added after creating this issue: There was a relevant issue before. See #1027.

1. Bug Description

The bug was identified by the witness of @jonwzheng and @hwpang when searching thermo for C1=CC2C=CC=1C=C2, the result was non-deterministic (try refreshing the page). The issue was first mentioned in #2490. However, this is not actually a decision tree issue, so I decided to make it a separate PR.

2. Reproducing the issue with RMG APIs

import os.path as osp
import rmgpy
from rmgpy.molecule.molecule import Molecule
from rmgpy.data.thermo import ThermoDatabase

tmdb = ThermoDatabase()
tmdb.load(osp.join(rmgpy.settings["database.directory"], "thermo"))

smi = "C1=CC2C=CC=1C=C2"
mol = Molecule().from_smiles(smi)
print(tmdb.compute_group_additivity_thermo(spc.molecule[0]).comment.split(":")[2])

The above block should be the minimal code example to reproduce the issue. You may see different thermo estimation comments from the execution. I found that after import is done, the result is deterministic. So In order to get different results, one needs to restart the kernel and import the module before actually calling the function.

3. Investigate get_deterministic_sssr

Then, I would like to briefly explain the results and the potential cause of its undeterminism.

I trace down the function calls and believe the real cause is that get_deterministic_sssr is not deterministic, at least for the investigating molecule.

The call stack is:

within ThermoDatabase
get_all_thermo_data > get_thermo_data_from_groups > estimate_thermo_via_group_additivity > compute_group_additivity_thermo > _add_polycyclic_correction_thermo_data > get_bicyclic_correction_thermo_data_from_heuristic > split_bicyclic_into_single_rings > bicyclic_submol.get_deterministic_sssr()

I typically see two kinds of results out of get_deterministic_sssr:

SSSR result 1:
[[0, 4, 5, 7, 6, 2], [2, 1, 3, 5, 7, 6]]
The corresponding molecule graph:
image
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)

So both of the two rings identified involve three double bonds ([0,4],[5,7], [7,6]) and ([1,3],[5,7],[7,6])

SSSR result 2:
[[2, 1, 4, 3, 0, 5], [0, 3, 4, 1, 7, 6]]
The corresponding molecule graph:
image
(note, atom index is 1-indexed in the graph, while SSSR result is 0-indexed)

Only the first ring has three double bonds ([2,1],[4,3], [5,2]). The second ring only has two double bonds ([3,4],[7,6]).

Within get_deterministic_sssr, the different results are due to nondeterminism when choosing the so-called root_vertex:

root_candidates_tups = []
for vertex in graph0.vertices:
tup = (vertex, get_vertex_connectivity_value(vertex), -origin_conn_dict[vertex])
root_candidates_tups.append(tup)
root_vertex = sorted(root_candidates_tups, key=lambda tup0: tup0[1:], reverse=True)[0][0]

The returned value by sorted(root_candidates_tups, key=lambda tup0: tup0[1:], reverse=True)

[(<Atom 'C'>, -603, 603),
 (<Atom 'C'>, -603, 603),
 (<Atom 'C'>, -603, 603),
 (<Atom 'C'>, -603, 603),
 (<Atom 'C'>, -603, 603),
 (<Atom 'C'>, -603, 603),
 (<Atom 'C'>, -879, 879),
 (<Atom 'C'>, -879, 879)]

Basically, the root vertexes are iteratively picked according to the connectivity value (get_vertex_connectivity_value), and connectivity values do not take bond orders into account (basically convolutionally counting the number of neighbors). Like the molecule in this example, the atoms on the bridge share the same values for a symmetric skeleton. Therefore, the sorting does not distinguish the atom with/without double bonds...

Then, In the case where the first picked root vertex is on the bridge involving two double bonds (e.g., carbon 7 or 8, see figure below), the smallest ring involving this bridge has 3 double bonds. However, this bridge will be removed after finding the smallest ring, causing the second ring to only have 2 double bonds instead of 3; instead, if the first picked root vertex is on the bridge involving only one double bond (e.g., carbon 1,2,4,5), then both smallest ring will have 3 double bonds, as the algorithm will maximize the total bond order of the ring

# Keep the smallest of the cycles found above
cycle_candidate_tups = []
for cycle0 in cycles:
tup = (cycle0, len(cycle0), -sum([origin_conn_dict[v] for v in cycle0]),
-sum([v.element.number for v in cycle0]),
-sum([v.get_total_bond_order() for v in cycle0]))
cycle_candidate_tups.append(tup)
cycle = sorted(cycle_candidate_tups, key=lambda tup0: tup0[1:])[0][0]
cycle_list.append(cycle)

image

4. Workaround

In the docstring, get_smallest_set_of_smallest_rings is recommended in "new" applications. However, a preliminary test shows it still suffers from the same issue.

I've also considered adding more sorting keys when choosing root vertex, e.g., considering bond orders. But couldn't figure out a good way to really avoid determinism...

It is possible to use get_relevant_cycles, which is deterministic, and choose a set of rings from the relevant cycles. It involves a more complicated rewrite and possibly redesigning the algorithm, and we still need to be cautious when selecting the rings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bug which will never be closed by the actions bot Complexity: High Priority: Low Topic: Thermo
Projects
None yet
Development

No branches or pull requests

1 participant