New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soundness bug with connected components #29
Comments
I agree, that looks like a bug. Thanks for including an example script
demonstrating it!
I’ll see if I can take a look next chance I get.
I’m curious about your use case. The connected component theory solver
should work (this bug aside), but it hasn’t been as carefully optimized or
as well tested as the maximum flow, reachability, or shortest path theory
solvers. If you are planning to use the connected component theory solver
at scale, it might need additional work.
…On Thu, Oct 29, 2020 at 9:18 PM Jeremy Salwen ***@***.***> wrote:
I am seeing a soundness bug with connected compontents, where it marks the
connected components expression and its negation as both true.
Example:
from monosat import *
from functools import reduce
import sys
import matplotlib.image as mpimg
import numpy as np
def print_districts(districts):
print("districts:")
for row in range(districts.shape[0]):
for col in range(districts.shape[1]):
for d in range(districts.shape[2]):
if districts[row, col, d].value():
print(f"{d:02}", end=" ")
print()
def solve_gerrymander(grid, num_districts):
district_size = grid.shape[0] * grid.shape[1] // num_districts
if district_size * num_districts != grid.shape[0] * grid.shape[1]:
print("Error: Need evenly divisible districts")
return
exact_victory = np.sum(grid) == (district_size // 2 + 1) * (num_districts // 2 + 1)
if not exact_victory:
print(
"Warning: this solver may be more efficient if the gerrymandering must be solved exactly."
)
districts = np.fromfunction(
np.vectorize(lambda r, c, d: Var()),
(grid.shape[0], grid.shape[1], num_districts),
dtype=object)
graphs = [Graph() for _ in range(num_districts)]
nodes = np.fromfunction(
np.vectorize(lambda r, c, d: graphs[d].addNode()),
(grid.shape[0], grid.shape[1], num_districts),
dtype=object)
# Each square is in exactly one district
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
AssertEqualPB([districts[row, col, d] for d in range(num_districts)], 1)
# Each district is of the right size.
for d in range(num_districts):
squares_contained = [
districts[row, col, d]
for row in range(grid.shape[0])
for col in range(grid.shape[1])
]
AssertEqualPB(squares_contained, district_size)
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
for dr, dc in [[1, 0], [0, 1]]:
other_r = row + dr
other_c = col + dc
if other_c < 0 or other_r < 0 or other_r >= grid.shape[
0] or other_c >= grid.shape[1]:
continue
for d in range(num_districts):
edge = graphs[d].addUndirectedEdge(nodes[row, col, d], nodes[other_r, other_c, d])
AssertEq(edge, And(districts[row, col, d], districts[other_r, other_c, d]))
ccs = []
ccns = []
# All graphs must have exactly N-district size+1 connected components
for d in range(num_districts):
ccs.append(graphs[d].connectedComponentsGreaterThan((num_districts-1) * district_size))
ccns.append(Not(graphs[d].connectedComponentsGreaterThan((num_districts-1) * district_size)))
Assert(ccs[-1])
Assert(ccns[-1])
# A majority of districts must be won.
districts_won = []
for d in range(num_districts):
votes = []
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
if grid[row, col]:
votes.append(districts[row, col, d])
if exact_victory:
# everything must be gerrymandered perfectly to solve.
AssertOr(EqualPB(votes, 0),
EqualPB(votes, district_size // 2 + 1))
else:
districts_won.append(GreaterEqPB(votes, district_size // 2 + 1))
if exact_victory:
# everything must be gerrymandered perfectly to solve.
AssertEqualPB(districts_won, num_districts // 2 + 1)
else:
AssertGreaterEqPB(districts_won, num_districts // 2 + 1)
result = Solve()
if result:
print("SAT")
for d in range(num_districts):
print(d, ccs[d].value())
print(d, ccns[d].value())
print_districts(districts)
else:
print("UNSAT")
def broken_example():
grid = np.array([[False, False, False, False, True],
[False, False, True, False, True],
[True, False, True, False, True],
[False, False, False, False, True],
[True, False, True, True, True]])
solve_gerrymander(grid, 5)
Output:
SAT
0 True
0 True
1 True
1 True
2 True
2 True
3 True
3 True
4 True
4 True
districts:
01 00 04 01 04
01 01 03 02 04
03 00 03 01 02
04 00 00 03 04
02 02 02 00 03
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#29>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOECW32NGQP7KTC2MIQKDSNI5CPANCNFSM4TEQ3U4A>
.
|
I am trying to see if SMT solvers can solve a gerrymandering puzzle. (Basically this example is the entire problem I am trying to solve, just for a larger grid). I originally encoded my puzzle using Z3, using a variant or the integer node-distance encoding you mentioned for reachability in your thesis (the constraint is slightly more relaxed, so that the distance must be one more than any neighbor rather than the minimum neighbor). I found that Z3 could solve it for small grids (using QF_FD), but once the grids got big, it struggled to even produce any partitioning of the grid into connected districts, even ignoring the gerrymandering constraints. I then tried to research how other people were encoding connectness constraints into SMT solvers and found monosat. The case I care about (some dynamically selected set of vertices is connected) doesn't map exactly to the predicates provided by monosat, but I thought that using connected components was the most direct implementation (using the fact that non-selected nodes will not be connected at all, thus always forming individual connected components of size 1.) I could also imagine encoding a whole bunch of reachability constraints from every vertex of a district to every other, but that would have a quadratic blowup. Another idea is I could create a graph of size K where K is the size of the district, dynamically assigning each vertex to a given square in the grid based on which squares are contained in the district, and compute reachability between these nodes by adding a bunch of conditions like "If node i is in position (x,y) and node j is in position (x+1,y), they are connected", but this seems like it would have a quadratic blowup as well. I am open to other suggestions for how to encode this problem. I also would be interested in other suggestions on how to encode connectedness for a general purpose SMT solver, besides the method I am using of encoding "steps from origin" as an integer variable. |
What is the maximum number of disconnected components you expect the graph to be partitioned into? Is it 2? I would consider reachability constraints (which are much more optimized in MonoSAT than the connected component constraints). Reachability predicates sharing the same source node can re-use most of their computations inside the solver, so try to arrange to have as few distinct source nodes in the reachability constraints as possible. One way I have done this in the past is to add one (or a small number) of extra 'source' nodes, and then use constraints to control which edges from that source node are enabled. This transformation can often allow many distinct source nodes to be replaced with one source node. Even then, you also want to try to have as few distinct reachability predicates as possible (as a secondary optimization, after minimizing the number of separate source nodes). However, I think that instead of using reachability constraints, you might want to use maxflow constraints. Maxflow constraints are well optimized in MonoSAT and often better suited for constraints involving many nodes (as opposed to having lots of separate reach predicates). I haven't tested it, but you may be able to use maxflow to enforce connectivity in the graph, by adding an extra destination node, with edges from each node that should be connected with capacity I'm not completely sure that this will work in your scenario; if you want to ensure multiple connected components you would potentially need one maxflow predicate per connected component. |
Oh wow I just realized that actually my constraint could be encoded as just a single connected component count constraint on a single graph. Really I am just trying to partition a graph into N connected subgraphs of equal size. Here is an implementation encoding it that way:
unfortunately this also results in an unsound solution:
I'm not sure exactly how I would encode my constraint using maxflow or reachability constraints. The problem is that there is no specific node that is guaranteed to be in any specific district. So for reachability, I just see there being a large number of reachability constraints, however I encode it. For max flow, I would need not only one constraint for each connected component, but also one constraint for each possible location for my source node, again, resulting in a large number of constraints. The fact that this encodes so nicely as a connected components query makes me think that there probably isn't a better way to encode it. Is there something I can do to debug this failure in monosat, or would it be deep in the internals beyond my comprehension? |
If you can get it down to a single connected component count constraint,
there is a good chance that is the best option.
I would probably need to debug it (and I should - I just might not have
time to in the immediate future).
But if you only need one connected component predicate, then maybe the bug
won't actually impact you?
If you create the duplicate predicates, monosat attempts to deduplicate
them into a single literal. My guess right now (not yet verified) is that
there is a sign error in that code impacting connected components.
If you just manually cache the literal returned by
"graph.connectedComponents" and reuse it, rather than calling the method
twice, the error might go away.
…On Mon., Nov. 2, 2020, 3:29 p.m. Jeremy Salwen, ***@***.***> wrote:
Oh wow I just realized that actually my constraint could be encoded as
just a single connected component count constraint on a single graph.
Really I am just trying to partition a graph into N connected subgraphs of
equal size. Here is an implementation encoding it that way:
from monosat import *
import numpy as np
def print_districts(districts):
print("districts:")
for row in range(districts.shape[0]):
for col in range(districts.shape[1]):
for d in range(districts.shape[2]):
if districts[row, col, d].value():
print(f"{d:02}", end=" ")
print()
def solve_gerrymander(grid, num_districts):
district_size = grid.shape[0] * grid.shape[1] // num_districts
if district_size * num_districts != grid.shape[0] * grid.shape[1]:
print("Error: Need evenly divisible districts")
return
exact_victory = np.sum(grid) == (district_size // 2 + 1) * (num_districts // 2 + 1)
if not exact_victory:
print(
"Warning: this solver may be more efficient if the gerrymandering must be solved exactly."
)
districts = np.fromfunction(
np.vectorize(lambda r, c, d: Var()),
(grid.shape[0], grid.shape[1], num_districts),
dtype=object)
graph = Graph()
nodes = np.fromfunction(
np.vectorize(lambda r, c: graph.addNode()),
(grid.shape[0], grid.shape[1]),
dtype=object)
# Each square is in exactly one district
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
AssertEqualPB([districts[row, col, d] for d in range(num_districts)], 1)
# Each district is of the right size.
for d in range(num_districts):
squares_contained = [
districts[row, col, d]
for row in range(grid.shape[0])
for col in range(grid.shape[1])
]
AssertEqualPB(squares_contained, district_size)
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
for dr, dc in [[1, 0], [0, 1]]:
other_r = row + dr
other_c = col + dc
if other_c < 0 or other_r < 0 or other_r >= grid.shape[
0] or other_c >= grid.shape[1]:
continue
edge = graph.addUndirectedEdge(nodes[row, col], nodes[other_r, other_c])
AssertEq(edge, Or(*[And(districts[row, col, d], districts[other_r, other_c, d]) for d in range(num_districts)]))
# All num_district districts must be connected.
Assert(graph.connectedComponentsGreaterThan(num_districts-1))
Assert(Not(graph.connectedComponentsGreaterThan(num_districts)))
# A majority of districts must be won.
districts_won = []
for d in range(num_districts):
votes = []
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
if grid[row, col]:
votes.append(districts[row, col, d])
if exact_victory:
# everything must be gerrymandered perfectly to solve.
AssertOr(EqualPB(votes, 0),
EqualPB(votes, district_size // 2 + 1))
else:
districts_won.append(GreaterEqPB(votes, district_size // 2 + 1))
if exact_victory:
# everything must be gerrymandered perfectly to solve.
AssertEqualPB(districts_won, num_districts // 2 + 1)
else:
AssertGreaterEqPB(districts_won, num_districts // 2 + 1)
result = Solve()
if result:
print("SAT")
print_districts(districts)
else:
print("UNSAT")
def broken_example():
grid = np.array([[False, False, False, False, True],
[False, False, True, False, True],
[True, False, True, False, True],
[False, False, False, False, True],
[True, False, True, True, True]])
solve_gerrymander(grid, 5)
unfortunately this also results in an unsound solution:
SAT
districts:
01 00 01 02 04
04 00 03 01 04
03 04 03 01 02
01 00 00 00 04
02 03 02 02 03
I'm not sure exactly how I would encode my constraint using maxflow or
reachability constraints. The problem is that there is no specific node
that is guaranteed to be in any specific district.
So for reachability, I just see there being a large number of reachability
constraints, however I encode it.
For max flow, I would need not only one constraint for each connected
component, but also one constraint for each possible location for my source
node, again, resulting in a large number of constraints.
The fact that this encodes so nicely as a connected components query makes
me think that there probably isn't a better way to encode it. Is there
something I can do to debug this failure in monosat, or would it be deep in
the internals beyond my comprehension?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOECQ3V253IVIDCPOTCITSN46HHANCNFSM4TEQ3U4A>
.
|
Thank you! The code from my previous comment has reduced it down to a single connected components call:
but unfortunately that was still unsound. |
I have a potential bug fix, but its not well tested enough yet to be merged into master. For now, I'm keeping these changes in a separate If you have a chance, could you check out the new 'connected_components' branch, recompile, and (critically) reinstall the python bindings with:
(the -f will ensure that the previous python API will be replaced with the new one). Both of your test examples are UNSAT in the new branch. |
Thanks, this seems to work. I think there is an off by one issue though.
results in UNSAT, but
results in a solution that has only num_districts connected components. |
So I am finding it is struggling (>24 h not solved) to partition a 7x7 grid into 7 equally sized connected components, even if I remove any of the vote counting constraints. (Note that this is relatively easy to solve by splitting into rows or columns). Is this surprising to you?
|
It's not surprising. The connected component solver currently has no theory
specific decision heuristics, and so if the only thing it is doing is
trying to carve up a space into equally sized components, it is basically
doing that by trial and error.
You might need to help the solver here, by adding additional constraints,
for example to remove symmetric solutions ("symmetry breaking constraints").
In your grid example, it looks to me like there are many equivalent
solutions to any satisfying solution, where the only difference would be
that the index assigned to a given connected component is swapped with a
different component.
One option would be to impose an ordering on these, for example such that
the left most component must be partition "0", and perhaps continuing on
such that the index of each component is ordered relative to the components
up and left of the top-left most node in the component.
Is every node part of exactly one district? In that case, you can
arbitrarily assign the top left node to district 0.
If you can statically know in advance that some nodes must belong in
certain connected components, that information could also help the solver.
From your constraints:
# All num_district districts must be connected.
Assert(graph.connectedComponentsLessOrEqualTo(num_districts + 1))
Is the +1 to account for the off by one error that you mentioned? If so, I
should be able to fix that.
…On Sun., Nov. 8, 2020, 4:12 p.m. Jeremy Salwen, ***@***.***> wrote:
So I am finding it is struggling (>24 h not solved) to partition a 7x7
grid into 7 equally sized connected components, even if I remove any of the
vote counting constraints. (Note that this is relatively easy to solve by
splitting into rows or columns). Is this surprising to you?
from monosat import *
import matplotlib.image as mpimg
import numpy as np
def print_districts(districts):
print("districts:")
for row in range(districts.shape[0]):
for col in range(districts.shape[1]):
for d in range(districts.shape[2]):
if districts[row, col, d].value():
print(f"{d:02}", end=" ")
print()
def solve_gerrymander(grid, num_districts):
district_size = grid.shape[0] * grid.shape[1] // num_districts
if district_size * num_districts != grid.shape[0] * grid.shape[1]:
print("Error: Need evenly divisible districts")
return
exact_victory = np.sum(grid) == (district_size // 2 + 1) * (num_districts // 2 + 1)
if not exact_victory:
print(
"Warning: this solver may be more efficient if the gerrymandering must be solved exactly."
)
districts = np.fromfunction(
np.vectorize(lambda r, c, d: Var()),
(grid.shape[0], grid.shape[1], num_districts),
dtype=object)
graph = Graph()
nodes = np.fromfunction(
np.vectorize(lambda r, c: graph.addNode()),
(grid.shape[0], grid.shape[1]),
dtype=object)
# Each square is in exactly one district
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
AssertEqualPB([districts[row, col, d] for d in range(num_districts)], 1)
# Each district is of the right size.
for d in range(num_districts):
squares_contained = [
districts[row, col, d]
for row in range(grid.shape[0])
for col in range(grid.shape[1])
]
AssertEqualPB(squares_contained, district_size)
for row in range(grid.shape[0]):
for col in range(grid.shape[1]):
for dr, dc in [[1, 0], [0, 1]]:
other_r = row + dr
other_c = col + dc
if other_c < 0 or other_r < 0 or other_r >= grid.shape[
0] or other_c >= grid.shape[1]:
continue
edge = graph.addUndirectedEdge(nodes[row, col], nodes[other_r, other_c])
AssertEq(edge, Or(*[And(districts[row, col, d], districts[other_r, other_c, d]) for d in range(num_districts)]))
# All num_district districts must be connected.
Assert(graph.connectedComponentsLessOrEqualTo(num_districts + 1))
result = Solve()
if result:
print("SAT")
print_districts(districts)
else:
print("UNSAT")
def broken_example():
grid = np.array([[True, True, False, False, False, False, False],
[True, False, False, False, True, False, True],
[False, False, False, False, False, False, True],
[False, False, False, False, True, False, True],
[False, False, True, False, True, False, True],
[True, False, False, False, False, False, True],
[False, False, True, False, True, True, True]])
solve_gerrymander(grid, 7)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOECSWOFU7YYIACURAZEDSO4XYJANCNFSM4TEQ3U4A>
.
|
Thanks for the suggestions. The +1 is indeed to account for the off-by-one error. I created an encoding that doesn't have all this extra symmetry by using a large number of max flow constraints. However, it seems to me there is another soundness error I am hitting again (maybe related to an off by one error, but with max flow constraints?) The idea I am using is to create a single source node with a directed connection of weight 1 to every grid node. Then I put infinite weight on any of the undirected edges between the grid nodes, and add a constraint to every node that the max flow to that node from the source node is exactly the district size. This seems to somewhat work, but I am seeing unsound solutions. (max flow of 4 when the constraint requires it to be >=5) Full example below:
Output:
Note that the upper left component of size four despite a constraint of max flow >=5. (Also note that I am adding the constraint that max flow <7 instead of <6, because it is harder to solve with the constraint that maxflow exactly equals 5, and the solution is still unsound with this constraint). |
About the off by one error in the connected component constraints: If you have a chance, could you create a small test demonstrating it? |
Hi Sam, thank you for the helpful debug options! It turns out the problem was with numpy vectorize adding extra edges to my graph: numpy/numpy#8758 (you can see it in the above graph,with two connections from node 26 to node 1) I believe this was also causing the "off by one" issue I was seeing. I am able to solve the problem for some graph sizes now, but I am hitting an internal crash sometimes for specific problem sizes. I am using a large number of maxflow constraints inside a pseudoboolean constraint Here is the smallest size I was able to reproduce the crash with:
Output:
|
Thank you for continuing to provide small test cases, they are incredibly helpful. I've reproduced this crash locally, and then reduced this down to the attached 'GNF' format constraints, which are a subset of the concrete constraints generated in MonoSAT by running your script that are sufficient to crash the solver. Is it ok with you if I include these and similar GNF format constraints generated from your examples as part of the monosat regression test suite, |
I have a candidate fix that you can try (in the |
Yes, I release this code and any derived files under the MIT license (copyright assigned to Google, not me though). |
Tried out the new changes, it seems like it's working successfully now with the max flow constraints! I also found that adding an additional constraint (that if two adjacent nodes are reachable from each other, then they must be connected directly) helped to speed up the solving. It was able to solve a 7x7 grid in a minute or less, but it still hasn't been able to solve any of the larger grid sizes yet. I think an interesting thing about this encoding (using max flow to encode district size and vote counts, with a "fully connected districts" constraint) is that there are no unnecessary degrees of freedom, i.e. each satisfying assignment corresponds to exactly one possible districting. However, I am still not so happy that there are such a huge number of max flow constraints, and I wonder if there is a better way to encode it with a smaller number of constraints (although maybe that doesn't really matter for solving performance?) Thanks again for all the help! |
Each graph predicate is relatively expensive. In general, you want to minimize the number of graph predciates wherever possible; and after having removed as many as you can, try to force as many of the remaining ones to constant 'true' or 'false' as you can get away with. MonoSAT is pretty heavily optimized for the case where there is exactly one graph predicate, and it is assigned to constant 'true' or 'false'. Do you expect your constraints to be SAT or UNSAT? Sounds like 'SAT' is the expected outcome. In that case, try setting the '-decide-theories' option, which enables theory specific decision heuristics. These can cause a huge speed up on satisfiable instances (for reachability and maxflow predicates), but can also cause pathologically bad behavior for some common use cases, so they aren't on by default. Have you considered trying a CEGAR loop ("counter-example guided abstraction refinement"). CEGAR is the goto technique for improving scalability in constraint solvers. When you have a 'large' constraint system that you want to solve, but which is too expensive to solve all at once, then you can start with an initially underconstrained (and easy to solve) version of the constraints, and then, each time the solver produces a solution, bring in additional constraints as needed to refute the SAT solvers solution (if that solution does not actually satisfy your requirements). MonoSAT is an incremental solver, which allows it to implement CEGAR loops efficiently. From python, what you can do is (roughly):
That |
Thanks for the suggestions! I am definitely expecting "SAT" as the outcome. I have tried using -decide theories in a few different contexts, but it doesn't seem to speed things up in general. I have also tried a few different variations of a CEGAR loop, without much success. In the process, I also learned that even solving a single graph constraint (that is trivially solvable by hand, like max flow to some node == 7) can be slow for monosat, probably related to the limitations of the structure of the constraint you mentioned above. Also is it possible that graph constraints are not solved incrementally? it seems like adding a constraint that is already satisfied causes monosat to generate a completely new solution usually. |
Incremental solving supports graph constraints; but there is in general no
guarantee that you will get the same solution after adding additional
constraints (graph or otherwise) and re-solving.
I'm very surprised that a single satisfiable maxflow constraints would be
slow, especially if theory decisions are enabled. I've solved challenging
constraints with large (100k+ node) graphs, in which a single maxflow
constraint was enforced, using the theory decision heuristics.
I am going to be very preoccupied for at least the next two weeks, but I'm
happy to keep brainstorming ways to attack this problem afterward. If you
attach a slow maxflow constraint example I'll take a look and see what
might be going on.
…On Mon., Dec. 7, 2020, 10:10 p.m. Jeremy Salwen, ***@***.***> wrote:
Thanks for the suggestions!
I am definitely expecting "SAT" as the outcome. I have tried using -decide
theories in a few different contexts, but it doesn't seem to speed things
up in general.
I have also tried a few different variations of a CEGAR loop, without much
success. In the process, I also learned that even solving a single graph
constraint (that is trivially solvable by hand, like max flow to some node
== 7) can be slow for monosat, probably related to the limitations of the
structure of the constraint you mentioned above.
Also is it possible that graph constraints are not solved incrementally?
it seems like adding a constraint that is already satisfied causes monosat
to generate a completely new solution usually.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#29 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABOECWJ43NENV73TFEN7ZDSTW7ONANCNFSM4TEQ3U4A>
.
|
I'm going to close this issue since we addressed the original correctness problem, but I'd love to keep brainstorming approaches to modelling this gerrymandering problem. If you want to chat more about it, could you open a new issue? |
I am seeing a soundness bug with connected compontents, where it marks the connected components expression and its negation as both true.
Example:
Output:
Note how I assert a constraint and its negation, but it still says SAT, and it evaluates both the constraint and its negation as true (all the "True"s printed out)
The text was updated successfully, but these errors were encountered: