Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgsql_big_dedupe_example fails #129

Open
wilko77 opened this issue Jul 11, 2022 · 3 comments
Open

pgsql_big_dedupe_example fails #129

wilko77 opened this issue Jul 11, 2022 · 3 comments

Comments

@wilko77
Copy link

wilko77 commented Jul 11, 2022

I ran the postgres example as-is with a postgres database version 14.2 and dedupe version 2.0.17.
After training and clustering, it will eventually fail during 'writing results' with the following error:

writing results
WARNING:dedupe.clustering:A component contained 656982 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.8445158759995937
Traceback (most recent call last):
  File "/Users/******/Code/dedupe-examples/pgsql_big_dedupe_example/pgsql_big_dedupe_example.py", line 304, in <module>
    write_cur.copy_expert('COPY entity_map FROM STDIN WITH CSV',
psycopg2.errors.QueryCanceled: COPY from stdin failed: error in .read() call: ValueError Iteration of zero-sized operands is not enabled
CONTEXT:  COPY entity_map, line 1
@evanmuller
Copy link

I'm experiencing a similar issue with the mysql_example:

creating entity_map database
A component contained 56250 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.9027395568206275
Traceback (most recent call last):
  File "mysql_example.py", line 277, in <module>
    write_cur.executemany('INSERT INTO entity_map VALUES (%s, %s, %s)',
  File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 230, in executemany
    return self._do_execute_many(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/MySQLdb/cursors.py", line 258, in _do_execute_many
    for arg in args:
  File "mysql_example.py", line 50, in cluster_ids
    for cluster, scores in clustered_dupes:
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/api.py", line 341, in cluster
    yield from clustering.cluster(scores, threshold)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 238, in cluster
    for sub_graph in dupe_sub_graphs:
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 51, in connected_components
    yield from _connected_components(edgelist, max_components)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 99, in _connected_components
    for sub_graph in _connected_components(filtered_sub_graph, max_components):
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 59, in _connected_components
    component_stops = union_find(edgelist)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/dedupe/clustering.py", line 114, in union_find
    it = numpy.nditer(edgelist, ["external_loop"])
ValueError: Iteration of zero-sized operands is not enabled

@evanmuller
Copy link

I got the mysql example to work by adding the "zerosize_ok" option to numpy.nditer in clustering.py. I imagine that this would also resolve the OP postres example. I'm not a python developer so I don't want to issue a PR for this until I have a better understanding of what's going on. In the union_find function in clustering.py, I changed...

it = numpy.nditer(edgelist, ["external_loop"])

to...

it = numpy.nditer(edgelist, ["external_loop", "zerosize_ok"])

lemig added a commit to lemig/dedupe that referenced this issue Oct 7, 2022
@twright8
Copy link

twright8 commented Feb 7, 2023

This still doesnt work for me, even with the fix above. Any new solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants